Skip to content

feat(USB PR 3): QEMU boot smoke-test for canonical installer ISO — cascade #5 dynamic boot floor#5322

Merged
AceHack merged 1 commit into
mainfrom
otto-cli/usb-cleanup-pr3-qemu-boot-smoke-test-build-ai-cluster-iso-workflow-2026-05-26
May 26, 2026
Merged

feat(USB PR 3): QEMU boot smoke-test for canonical installer ISO — cascade #5 dynamic boot floor#5322
AceHack merged 1 commit into
mainfrom
otto-cli/usb-cleanup-pr3-qemu-boot-smoke-test-build-ai-cluster-iso-workflow-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Summary

USB cleanup PR 3 of 3. Adds dynamic boot-time verification to the canonical AI-cluster ISO build pipeline. Catches the bug class where the ISO builds + audits pass but the kernel/initrd combination fails to actually boot (firmware mismatch; missing module; broken init).

Per Aaron's direction: "push iso testing closer into the ci instead of neading human to physically test usb but also after a few rounds i will physically test teh usb".

Per Kestrel's ferry pointer (PR #5310 research doc): prior art at `nixos/tests/installer.nix`.

What lands (2 files)

1. `tools/ci/qemu-boot-test.ts` (~150 lines, Rule 0 compliant)

TS helper that spawns `qemu-system-x86_64` with KVM acceleration (TCG fallback when KVM unavailable for local testing), captures serial console to log file, waits up to 5min for the installer's expected login prompt (`zeta-installer login:` — matches `networking.hostName = "zeta-installer"` in the canonical installer config), kills QEMU, returns exit code.

  • 2GB RAM + 2 SMP cores (installer needs >= 1GB; 2GB headroom)
  • q35 machine type (modern PCIe; matches Beelink hardware profile better than legacy i440fx)
  • BIOS boot (simpler than UEFI; ISO supports both)
  • Exit codes: 0 success / 1 boot failure / 2 usage error

2. `.github/workflows/build-ai-cluster-iso.yml` extension

Adds 2 new steps AFTER the existing "Audit installer ISO content" step + BEFORE "Locate ISO + capture metadata":

No `github.event.*` interpolation in run: lines per the GitHub Actions script-injection security guide.

Verification cascade post-PR-3

# Step When Cost
1 Source-substrate audit Preflight ~1s
4 ISO content audit Post-build (7z list) ~10s
5 QEMU boot smoke-test Post-build (KVM boot) ~3-5min
- Locate + metadata + artifact upload Post-build existing

Estimated CI time impact: +3-5min per build (KVM keeps it fast vs TCG).

What this is NOT (substrate-honest defer list)

This is the SIMPLEST viable boot test. Once it lands + runs across a few cycles + catches at least one real boot regression (or demonstrates none for N runs), Aaron's physical USB test gate fires.

Composes with

Test plan

  • Pre-commit canary green (HEAD 60 = HEAD~1 60; modifications + 1 new TS helper)
  • Branch follows `otto-cli/*` surface-prefix convention
  • Authored from fresh independent clone
  • No `github.event.*` interpolation in run: lines (security-reminder hook pattern)
  • CI green (the new QEMU step will exercise itself on this PR)
  • Copilot review pass

…scade #5 dynamic boot floor (Kestrel ferry pointer; Aaron 2026-05-26)

USB cleanup PR 3 of 3. Adds dynamic boot-time verification to the
canonical AI-cluster ISO build pipeline. Catches the bug class
where the ISO builds + audits pass but the kernel/initrd
combination fails to actually boot (firmware mismatch; missing
module; broken init; etc.).

Aaron direction: "lets try to cleanup what we have in a few prs
and combine get rid of the old and try to push iso testing closer
into the ci instead of neading human to physically test usb but
also after a few rounds i will physically test teh usb" +
"you don't have to ask me direction every time you can just
assume all with the simplest first".

Prior art: nixos/tests/installer.nix (Kestrel 2026-05-26 ferry
pointer; preserved at docs/research/2026-05-26-kestrel-runme-
jit-runbook-bcl-extension-cost-of-velocity-decision-archaeology-
aaron-forwarded.md via PR #5310).

What lands (2 files):

1. tools/ci/qemu-boot-test.ts (new; ~150 lines)
   TS helper that spawns qemu-system-x86_64 with KVM acceleration
   (TCG fallback when KVM unavailable), captures serial console to
   log file, waits up to 5min for the installer's expected login
   prompt ("zeta-installer login:" — matches networking.hostName
   = "zeta-installer" in full-ai-cluster/usb-nixos-installer/nixos/
   installer/configuration.nix), then kills QEMU + returns exit
   code.
   - Per Rule 0: TS-over-bash for cross-platform DST
   - 2GB RAM + 2 SMP cores (installer needs >= 1GB; 2GB headroom)
   - q35 machine type (modern PCIe; matches Beelink hardware
     profile better than legacy i440fx)
   - BIOS boot (simpler than UEFI; ISO supports both)
   - Exit codes: 0 success / 1 boot failure / 2 usage error

2. .github/workflows/build-ai-cluster-iso.yml extension
   Adds 2 new steps AFTER the existing "Audit installer ISO
   content" step + BEFORE "Locate ISO + capture metadata":
   - "Install QEMU (apt)" — apt-get install qemu-system-x86 on
     ubuntu-24.04 runner (~30s)
   - "QEMU boot smoke-test (cascade #5 — dynamic boot floor)" —
     invokes the TS helper against the built ISO
   No github.event.* interpolation in run: lines; all inputs are
   filesystem paths from prior steps of THIS workflow per the
   GitHub Actions script-injection security guide.

Verification cascade now reads (post-PR-3):
- Cascade #1: source-substrate audit (preflight; ~1s)
- Cascade #4: ISO content audit (post-build; ~10s; verifies expected
  top-level files via 7z list)
- Cascade #5: QEMU boot smoke-test (post-build; ~3-5min; verifies
  ISO actually boots to login prompt)
- Locate ISO + metadata + workflow artifact upload (existing)

Estimated CI time impact: +3-5min per build (QEMU boot is the slow
step; KVM keeps it fast vs TCG emulation).

What this is NOT (substrate-honest defer list):
- NOT a full integration test (doesn't login + run commands +
  verify zeta-install works) — future B-NNNN follow-up
- NOT a multi-arch test (x86_64 only; aarch64 ISO is a separate
  build path if/when needed)
- NOT a hardware-specific test (UEFI variant; specific GPU
  configurations; etc.) — physical USB test on real Beelink fills
  that gap (Aaron 2026-05-26: "after a few rounds i will physically
  test the usb")
- NOT a release-attach step (B-0830 follow-up filed in USB PR 2)

This is the SIMPLEST viable boot test. Once it lands + runs across
a few cycles + catches at least one real boot regression (or
demonstrates none happen for N runs), Aaron's physical USB test
gate fires + the test surface matures incrementally.

Composes with: PR #5311 (USB cleanup PR 1); PR #5320 (USB cleanup
PR 2); B-0830 (release-attach follow-up); .claude/rules/rule-0-no-
sh-files (TS-over-bash discipline); .claude/rules/refresh-world-
model-poll-pr-gate (authored from fresh independent clone per
B-0828); substrate-check-before-worry-deployment (audit-then-act
discipline applied to the new test surface).

Authored from fresh independent clone at /private/tmp/zeta-clone-
2026-05-26 per Aaron's destructive-git-on-isolated-copies
authorization + B-0828 multi-AI shared-checkout convention.
Copilot AI review requested due to automatic review settings May 26, 2026 21:12
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack enabled auto-merge (squash) May 26, 2026 21:12
@AceHack AceHack merged commit 807364d into main May 26, 2026
33 of 35 checks passed
@AceHack AceHack deleted the otto-cli/usb-cleanup-pr3-qemu-boot-smoke-test-build-ai-cluster-iso-workflow-2026-05-26 branch May 26, 2026 21:14
AceHack added a commit that referenced this pull request May 26, 2026
… QEMU boot smoke-test can capture systemd/getty output via serial console (Aaron 2026-05-26) (#5324)

First-cycle QEMU boot smoke-test on main (PR #5322 cascade #5) failed: ISO boots fine (GRUB + kernel + initrd loaded successfully per serial log) but timed out waiting for 'zeta-installer login:' prompt. Root cause: NixOS installer default outputs only to VGA tty1; QEMU's -display none hides VGA; serial console (-serial file:...) gets no kernel/systemd/getty output after bootloader stage.

Fix: add boot.kernelParams = [ "console=ttyS0,115200n8" "console=tty1" ] to installer config so systemd/getty mirrors to serial console at standard 115200 8N1. tty1 stays primary (keyboard-attached install flow); ttyS0 is secondary for QEMU capture + real hardware with serial headers (some Beelinks; most server-class boards; debugging scenarios).

Substrate-honest framing: the QEMU test isn't broken; it correctly caught that serial console wasn't configured. The ISO itself isn't broken (boots cleanly). The MISSING config (serial console) was a real gap that the new cascade #5 surfaced on first cycle — which is exactly what the test was designed to do per its commit message: 'catches the bug class where the ISO builds + audits pass but the kernel/initrd combination fails to actually boot' (or in this case, fails to surface boot output through the channels tests can observe).

Composes with: PR #5322 (the QEMU boot smoke-test workflow); B-0754 iter-3 firmware substrate (similar UX-cleanliness motivation: surface less mysterious behavior); the canonical zflash + zeta-install flow (no behavioral change for the keyboard-attached install flow since tty1 stays primary). Authored from fresh independent clone per B-0828 multi-AI shared-checkout convention.

Co-authored-by: Lior <lior@zeta.dev>
@AceHack AceHack review requested due to automatic review settings May 26, 2026 21:33
AceHack added a commit that referenced this pull request May 26, 2026
…turn terminology distinction + split 3-PR-cleanup + follow-up-fix-PR correctly (#5329)

Both Copilot findings verified + addressed: (1) multi-turn (overall conversation length) vs zero-turn (pathogen-decryption-protocol cost) are distinct scopes; clarified terminology in title + table-intro + empirical-generalization paragraph so readers don't read the table's Zero-turn entries as contradicting the multi-turn claim. (2) USB cleanup arc had 3-PR cleanup sequence (#5311 + #5320 + #5322) + follow-up fix (#5324) — split for narrative consistency. No semantic change; clarification only.

Co-authored-by: Lior <lior@zeta.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant