Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
pr_number: 5047
title: "fix(B-0754 iter-2): empty systemd PATH broke clear+nmtui+ping+systemctl on real hardware"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:10:32Z"
merged_at: "2026-05-26T00:22:03Z"
closed_at: "2026-05-26T00:22:04Z"
head_ref: "otto-cli/b0754-fix-systemd-path-iter2-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:43Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5047: fix(B-0754 iter-2): empty systemd PATH broke clear+nmtui+ping+systemctl on real hardware

## PR description

## Iteration 1 result (real-hardware test, Aaron 2026-05-25)

Photo evidence on the cluster node screen after booting the v1 ISO:

- `clear: command not found` (line 40 + line 77) — the role-prompt and banner sections
- `nmtui: command not found` — when ethernet-DHCP wait expired and the wifi-fallback fired
- Drop-to-shell worked correctly — operator landed at a working root prompt with the recovery hints intact

The substrate-honest failure path validated: the script degraded gracefully and the operator could still complete the install manually (`nmtui` + `zeta-install control-plane` from the recovery shell). But the load-bearing zero-typing-automation flow didn't reach the end. **This PR is the fix so iteration 2 completes unattended.**

## Root cause

NixOS systemd services get a **minimal PATH** by default. The first-boot script's bare commands (`clear`, `nmtui`, `ping`, `systemctl`, plus every command zeta-install.sh would reach for — `lsblk`, `sgdisk`, `mkfs.fat`, `mkfs.ext4`, `mount`, `partprobe`, `partprobe`, etc.) all need either explicit absolute paths OR a configured Environment block on the systemd unit. The interactive-shell PATH that 'just works' for SSH or tty2 login is NOT inherited by Type=idle systemd services.

The reason only `clear` and `nmtui` were observed: `nmtui` blocked first; the rest never executed in the failed path.

## Fix (defense in depth)

### 1. systemd unit Environment block (load-bearing)

`configuration.nix`: explicit `PATH` + `TERM` on the zeta-first-boot service. Covers every current AND future bare command:

```nix
environment = {
PATH = "/run/current-system/sw/bin:/run/current-system/sw/sbin:/run/wrappers/bin";
TERM = "linux";
};
```

### 2. Script-level belt-and-suspenders

`zeta-first-boot.sh`:

- Replace `clear || true` (×2) with `printf '\\033c' || true` — ANSI 'reset terminal' escape; no external command dependency
- Change `nmtui` invocation to `/run/current-system/sw/bin/nmtui` (absolute path)

Even if the systemd Environment is overridden by some future change, these two failure modes stay fixed.

## Composes with

- B-0759 first-time-CLI-user persona — drop-to-shell with recovery hints worked exactly as designed; the persona-aligned error path was substrate-honest
- B-0760 USB-as-repair-tool — same systemd-PATH discipline applies to every command the repair flow will invoke
- B-0761 reference architecture — this is iteration N of N for the AI-native cluster-bootstrap reference; bandwidth payoff across every future install

## Test plan

- [x] `bash -n` syntax check on edited zeta-first-boot.sh
- [ ] CI rebuilds ISO via build-ai-cluster-iso.yml (auto-triggers on `full-ai-cluster/usb-nixos-installer/**` path)
- [ ] Aaron reflashes via zflash + boots cluster node + observes unattended install completes end-to-end
- [ ] CI green

## Reviews

### COMMENTED — @copilot-pull-request-reviewer (2026-05-26T00:12:22Z)

## Pull request overview

Fixes the installer ISO’s first-boot automation failing under NixOS systemd’s minimal default `PATH`, so the unattended “zero-typing” flow can complete on real hardware.

**Changes:**
- Replaced `clear` with an ANSI terminal reset escape to remove reliance on `clear` being in `PATH`.
- Invoked `nmtui` via an absolute path to avoid `PATH`-inheritance issues.
- Set explicit `PATH` and `TERM` in the `zeta-first-boot` systemd unit to cover current and future bare command usage in the first-boot flow.

### Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

| File | Description |
| ---- | ----------- |
| full-ai-cluster/usb-nixos-installer/zeta-first-boot.sh | Removes dependency on `clear` being available and pins `nmtui` to an absolute path for robustness under systemd. |
| full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix | Adds explicit `PATH`/`TERM` to the first-boot systemd unit to ensure all invoked tools are discoverable in the unit environment. |

### COMMENTED — @copilot-pull-request-reviewer (2026-05-26T00:21:32Z)

## Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

## Review threads

### Thread 1: full-ai-cluster/usb-nixos-installer/zeta-first-boot.sh:120 (resolved)

**@copilot-pull-request-reviewer** (2026-05-26T00:12:22Z):

The comment mentions “the env-var below” as a defense, but there isn’t an environment variable being set in the script at/after this point. Consider rewording to reference the systemd unit Environment block (and/or explicitly exporting a PATH in the script if that’s what you meant) so the rationale stays accurate for future readers.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:19:05Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).

### @AceHack (2026-05-26T00:19:11Z)

Fixed in `fd160ff1`: reworded the comment to reference the systemd unit's `environment.PATH` override (set in `configuration.nix` on `systemd.services.zeta-first-boot.environment.PATH` via `lib.mkForce`) rather than implying an env-var is set in the shell script. Thanks for the catch.
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
pr_number: 5048
title: "backlog(B-0762): AI auto-submit-back telemetry + fixes from in-the-wild installs \u2014 adoption-cost-to-zero flywheel"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:12:34Z"
merged_at: "2026-05-26T00:14:55Z"
closed_at: "2026-05-26T00:14:55Z"
head_ref: "otto-cli/b0762-ai-auto-submit-back-telemetry-flywheel-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:42Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5048: backlog(B-0762): AI auto-submit-back telemetry + fixes from in-the-wild installs — adoption-cost-to-zero flywheel

## PR description

Aaron 2026-05-25 named the adoption-scaling mechanism: each in-the-wild install opt-in submits success/failure telemetry; failures trigger LLM-generated minimal-diff PRs; pattern-matched fixes auto-merge; next operator on similar hardware pays 0 cost. Network-effect substrate; ARC-AGI parallel (B-0761) provides the training data + benchmark catalog. Composes with B-0743 / B-0754-B-0761.

## Reviews

### COMMENTED — @copilot-pull-request-reviewer (2026-05-26T00:15:24Z)

## Pull request overview

Adds a new P2 backlog row (B-0762) describing an install-time opt-in telemetry loop that can auto-generate minimal-diff fix PRs from in-the-wild failures, aiming to drive “adoption cost → 0” via a network-effect feedback cycle.

**Changes:**
- Added backlog row **B-0762** capturing the telemetry + auto-fix flywheel concept, acceptance criteria, and privacy/security floor.
- Updated `docs/BACKLOG.md` index to include **B-0762** under P2.

### Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

| File | Description |
| ---- | ----------- |
| docs/backlog/P2/B-0762-ai-auto-submit-back-telemetry-fixes-from-in-the-wild-installs-adoption-cost-to-zero-flywheel-aaron-2026-05-25.md | New backlog row defining the telemetry→LLM PR→auto-merge gating loop and acceptance criteria. |
| docs/BACKLOG.md | Adds B-0762 to the generated backlog index. |

## Review threads

### Thread 1: docs/backlog/P2/B-0762-ai-auto-submit-back-telemetry-fixes-from-in-the-wild-installs-adoption-cost-to-zero-flywheel-aaron-2026-05-25.md:14 (unresolved)

**@copilot-pull-request-reviewer** (2026-05-26T00:15:24Z):

`composes_with` references `B-0743`, but there is no corresponding backlog row (`id: B-0743`) anywhere in the repo. Either add the missing backlog row (or land it in the same PR/stack) or drop this reference to avoid dangling IDs.

### Thread 2: docs/backlog/P2/B-0762-ai-auto-submit-back-telemetry-fixes-from-in-the-wild-installs-adoption-cost-to-zero-flywheel-aaron-2026-05-25.md:19 (unresolved)

**@copilot-pull-request-reviewer** (2026-05-26T00:15:24Z):

`composes_with` includes `B-0758` and `B-0760`, but there are no corresponding backlog rows (`id: B-0758` / `id: B-0760`) in the repo. Add those rows (or update to the correct existing IDs) so backlog cross-references stay resolvable.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:12:38Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
pr_number: 5050
title: "backlog(B-0764): CNCF ecosystem as force multipliers behind Zeta interfaces (KEDA, DAPR, OPA, OAM/KubeVela)"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:16:48Z"
merged_at: "2026-05-26T00:18:54Z"
closed_at: "2026-05-26T00:18:54Z"
head_ref: "otto-cli/b0764-cncf-ecosystem-force-multipliers-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:41Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5050: backlog(B-0764): CNCF ecosystem as force multipliers behind Zeta interfaces (KEDA, DAPR, OPA, OAM/KubeVela)

## PR description

Aaron 2026-05-25 named the CNCF-ecosystem-as-force-multipliers pattern: adopt KEDA + DAPR + OPA + OAM/KubeVela + Crossplane + kro + Cilium + Knative + OpenTelemetry + Rook/Ceph as plugins behind Zeta interfaces. Composed via Ace (PM-of-PMs) + ontology negotiation (B-0741). Zeta owns the stable operator contract; CNCF projects ship the substrate underneath; operator gets ecosystem momentum + maintenance distribution for free.

Disambiguates OPA (Open Policy Agent / Rego) vs OAM (Open Application Model). Composes with B-0741 / B-0747 / B-0748 / B-0749 / B-0761 / B-0762 / B-0763.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:16:51Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
pr_number: 5051
title: "backlog(B-0765 P1): ServiceTitan route \u2014 plug into existing control interfaces; ontology negotiation at standards layer"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:20:16Z"
merged_at: "2026-05-26T00:21:49Z"
closed_at: "2026-05-26T00:21:49Z"
head_ref: "otto-cli/b0765-service-titan-route-standards-layer-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:41Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5051: backlog(B-0765 P1): ServiceTitan route — plug into existing control interfaces; ontology negotiation at standards layer

## PR description

Aaron 2026-05-25 named the ServiceTitan strategic principle as substrate-engineering policy: every cluster-install substrate decision filters through 'are we inventing or adopting?'. Prefer adopting existing standards (k8s CRDs, OAM Components, Crossplane Compositions, Helm 3 OCI, ArgoCD, Flux, OpenTelemetry, OPA Rego, DAPR Components, NixOS) over inventing parallel substrate. Sharpens (not retracts) B-0763 + B-0764. Ontology negotiation at the STANDARDS LAYER (not per-project) is the load-bearing leverage point.

P1 because it's the strategic filter shape every future cluster-install row should pass through. Composes with B-0741 / B-0744 / B-0747 / B-0748 / B-0749 / B-0754 / B-0759 / B-0761 / B-0762 / B-0763 / B-0764.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:20:21Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
pr_number: 5053
title: "backlog(B-0767 P1): Zeta-native scheduler first \u2014 DST + AI-aware cluster management"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:23:39Z"
merged_at: "2026-05-26T00:25:44Z"
closed_at: "2026-05-26T00:25:44Z"
head_ref: "otto-cli/b0767-zeta-native-scheduler-first-dst-ai-aware-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:40Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5053: backlog(B-0767 P1): Zeta-native scheduler first — DST + AI-aware cluster management

## PR description

Aaron 2026-05-25 sequencing call on B-0766 wave order: scheduler is load-bearing enough on (1) DST grounding for the whole cluster + (2) AI-aware scheduling (GPU topology, model locality, workload class, energy cost) that it should be Wave 1, not Wave 2. Binary-compatible via Pod.spec.schedulerName: zeta-scheduler (per B-0765 ServiceTitan route). Sub-waves A-E from baseline through DBSP+Bayesian + multi-objective.

First concrete sub-row of B-0766. Composes with B-0428 (F# fork) + B-0741 / B-0747 / B-0754 / B-0761 / B-0762 / B-0763 / B-0764 / B-0765 / B-0766.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:23:43Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
pr_number: 5055
title: "backlog(B-0769 P1): VC meta-playbook \u2014 control-structure injection around capital flow + adj-EBITDA financial-engineering layer (substrate-honest variant)"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:27:03Z"
merged_at: "2026-05-26T00:28:52Z"
closed_at: "2026-05-26T00:28:52Z"
head_ref: "otto-cli/b0769-vc-meta-playbook-control-structure-injection-substrate-honest-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:39Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5055: backlog(B-0769 P1): VC meta-playbook — control-structure injection around capital flow + adj-EBITDA financial-engineering layer (substrate-honest variant)

## PR description

Aaron 2026-05-25 named the meta-strategy under B-0765 + B-0768: VC meta-playbook (control-structure injection around capital flow in verticals) IS the playbook ServiceTitan executed. Plus sharpening 'with adjusted ebita for specific investor profiles'.

Substrate-honest variant: same control-structure position; opposite value-flow direction. Operator keeps value (open reference per B-0761, vendor-swap per B-0763, opt-in open telemetry per B-0762, binary-compat per B-0766). Failure-mode table + ethical-floor rules guard against extractive drift.

Financial-engineering layer: adjusted EBITDA shaped per investor profile (seed/A → growth equity → pre-IPO → PE → strategic → public markets); substrate-honest variant = TRUE adjustments + single GAAP→non-GAAP bridge + auditor-survivable. ServiceTitan empirical anchor.

Composes with B-0741 / B-0754 / B-0759 / B-0761-B-0768 + 5 ethical-floor rules.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:27:07Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
pr_number: 5057
title: "fix(B-0754 iter-3): per-device partprobe \u2014 bare partprobe was hitting the boot USB (/dev/sda) + bailing"
author: "AceHack"
state: "MERGED"
created_at: "2026-05-26T00:54:23Z"
merged_at: "2026-05-26T01:43:12Z"
closed_at: "2026-05-26T01:43:12Z"
head_ref: "otto-cli/b0754-iter3-partprobe-per-device-2026-05-25"
base_ref: "main"
archived_at: "2026-05-27T19:46:38Z"
archive_tool: "tools/pr-preservation/archive-pr.ts"
---

# PR #5057: fix(B-0754 iter-3): per-device partprobe — bare partprobe was hitting the boot USB (/dev/sda) + bailing

## PR description

## Iteration 2 result (cluster node 1, real-hardware test, Aaron 2026-05-25)

Photo evidence on PC 1 shows iter-2 reached 98% of the install path on first try:

- Wifi connected
- Banner shown
- Greedy N-disk enum: both Crucial CT1000P3PSSD8 NVMes correctly identified with serials
- Plan presented (BOOT nvme0n1: ESP 1G + root 256G + longhorn1 rest; DATA nvme1n1: whole disk longhorn2)
- ZETA_AUTO_CONFIRM=WIPE bypass: WORKED
- wipefs + sgdisk on both NVMes: SUCCESS
- GPT partition creation on both NVMes: SUCCESS
- **Then**: `Error: Partition(s) 1 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use`
- drop_to_shell fired correctly with recovery hints

## Root cause

zeta-install.sh called bare `sudo partprobe` (no args). partprobe with no args probes EVERY block device the kernel knows about. Linux exposes USB mass-storage as `/dev/sda` when no SATA disks present. The booted live ISO has mounted partitions on /dev/sda; partprobe refuses to refresh (rightfully); returns non-zero; `set -euo pipefail` bails.

The greedy N-disk enum ALREADY correctly excluded the USB (TRAN=usb filter). We never partitioned /dev/sda. The partprobe call was the only blanket-all-devices invocation in the whole script.

## Fix

Per-device partprobe on BOOT_DISK + each DATA_DISKS entry. Never blanket. Never touch /dev/sda.

Aaron 2026-05-25: 'i would rather do it right so it's not ambigious for future me / users' — the script now operates on explicit-target devices throughout, no blanket-system-wide invocations remaining.

## Test plan

- [x] `bash -n` syntax check
- [ ] CI rebuilds ISO via build-ai-cluster-iso.yml
- [ ] Aaron reflashes via zflash + boots cluster node 1 + observes unattended install reaches end-to-end (cluster member after reboot)
- [ ] CI green

## Reviews

### COMMENTED — @copilot-pull-request-reviewer (2026-05-26T00:56:30Z)

## Pull request overview

This PR updates the AI-cluster USB NixOS installer script to avoid calling `partprobe` without arguments (which probes every block device, including the live-boot USB), and instead refreshes the kernel partition table only for the disks that were just partitioned.

**Changes:**
- Replace blanket `sudo partprobe` with per-device `partprobe` calls for `BOOT_DISK` and each `DATA_DISKS` entry.
- Add inline rationale documenting why blanket probing is unsafe in this installer context.

### COMMENTED — @copilot-pull-request-reviewer (2026-05-26T01:11:51Z)

## Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

## Review threads

### Thread 1: full-ai-cluster/usb-nixos-installer/zeta-install.sh:171 (resolved)

**@copilot-pull-request-reviewer** (2026-05-26T00:56:29Z):

The comment says the kernel exposes the boot USB as `/dev/sda`, but the assigned `/dev/sdX` letter isn't guaranteed across hardware/boot order. Rewording this to something like “often /dev/sdX (commonly /dev/sda)” would keep the rationale accurate.

### Thread 2: full-ai-cluster/usb-nixos-installer/zeta-install.sh:175 (resolved)

**@copilot-pull-request-reviewer** (2026-05-26T00:56:30Z):

With `set -euo pipefail`, a `partprobe` failure here will abort the install. Adding an explicit failure handler (e.g., `... || bail "partprobe failed for $disk ..."`) would make the abort message more immediately actionable by identifying the disk and suggesting next steps.

## General comments

### @chatgpt-codex-connector (2026-05-26T00:54:27Z)

You have reached your Codex usage limits for code reviews. You can see your limits in the [Codex usage dashboard](https://chatgpt.com/codex/cloud/settings/usage).
Loading
Loading