feat(USB PR 3): QEMU boot smoke-test for canonical installer ISO — cascade #5 dynamic boot floor#5322
Merged
AceHack merged 1 commit intoMay 26, 2026
Conversation
…scade #5 dynamic boot floor (Kestrel ferry pointer; Aaron 2026-05-26) USB cleanup PR 3 of 3. Adds dynamic boot-time verification to the canonical AI-cluster ISO build pipeline. Catches the bug class where the ISO builds + audits pass but the kernel/initrd combination fails to actually boot (firmware mismatch; missing module; broken init; etc.). Aaron direction: "lets try to cleanup what we have in a few prs and combine get rid of the old and try to push iso testing closer into the ci instead of neading human to physically test usb but also after a few rounds i will physically test teh usb" + "you don't have to ask me direction every time you can just assume all with the simplest first". Prior art: nixos/tests/installer.nix (Kestrel 2026-05-26 ferry pointer; preserved at docs/research/2026-05-26-kestrel-runme- jit-runbook-bcl-extension-cost-of-velocity-decision-archaeology- aaron-forwarded.md via PR #5310). What lands (2 files): 1. tools/ci/qemu-boot-test.ts (new; ~150 lines) TS helper that spawns qemu-system-x86_64 with KVM acceleration (TCG fallback when KVM unavailable), captures serial console to log file, waits up to 5min for the installer's expected login prompt ("zeta-installer login:" — matches networking.hostName = "zeta-installer" in full-ai-cluster/usb-nixos-installer/nixos/ installer/configuration.nix), then kills QEMU + returns exit code. - Per Rule 0: TS-over-bash for cross-platform DST - 2GB RAM + 2 SMP cores (installer needs >= 1GB; 2GB headroom) - q35 machine type (modern PCIe; matches Beelink hardware profile better than legacy i440fx) - BIOS boot (simpler than UEFI; ISO supports both) - Exit codes: 0 success / 1 boot failure / 2 usage error 2. .github/workflows/build-ai-cluster-iso.yml extension Adds 2 new steps AFTER the existing "Audit installer ISO content" step + BEFORE "Locate ISO + capture metadata": - "Install QEMU (apt)" — apt-get install qemu-system-x86 on ubuntu-24.04 runner (~30s) - "QEMU boot smoke-test (cascade #5 — dynamic boot floor)" — invokes the TS helper against the built ISO No github.event.* interpolation in run: lines; all inputs are filesystem paths from prior steps of THIS workflow per the GitHub Actions script-injection security guide. Verification cascade now reads (post-PR-3): - Cascade #1: source-substrate audit (preflight; ~1s) - Cascade #4: ISO content audit (post-build; ~10s; verifies expected top-level files via 7z list) - Cascade #5: QEMU boot smoke-test (post-build; ~3-5min; verifies ISO actually boots to login prompt) - Locate ISO + metadata + workflow artifact upload (existing) Estimated CI time impact: +3-5min per build (QEMU boot is the slow step; KVM keeps it fast vs TCG emulation). What this is NOT (substrate-honest defer list): - NOT a full integration test (doesn't login + run commands + verify zeta-install works) — future B-NNNN follow-up - NOT a multi-arch test (x86_64 only; aarch64 ISO is a separate build path if/when needed) - NOT a hardware-specific test (UEFI variant; specific GPU configurations; etc.) — physical USB test on real Beelink fills that gap (Aaron 2026-05-26: "after a few rounds i will physically test the usb") - NOT a release-attach step (B-0830 follow-up filed in USB PR 2) This is the SIMPLEST viable boot test. Once it lands + runs across a few cycles + catches at least one real boot regression (or demonstrates none happen for N runs), Aaron's physical USB test gate fires + the test surface matures incrementally. Composes with: PR #5311 (USB cleanup PR 1); PR #5320 (USB cleanup PR 2); B-0830 (release-attach follow-up); .claude/rules/rule-0-no- sh-files (TS-over-bash discipline); .claude/rules/refresh-world- model-poll-pr-gate (authored from fresh independent clone per B-0828); substrate-check-before-worry-deployment (audit-then-act discipline applied to the new test surface). Authored from fresh independent clone at /private/tmp/zeta-clone- 2026-05-26 per Aaron's destructive-git-on-isolated-copies authorization + B-0828 multi-AI shared-checkout convention.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
5 tasks
AceHack
added a commit
that referenced
this pull request
May 26, 2026
… QEMU boot smoke-test can capture systemd/getty output via serial console (Aaron 2026-05-26) (#5324) First-cycle QEMU boot smoke-test on main (PR #5322 cascade #5) failed: ISO boots fine (GRUB + kernel + initrd loaded successfully per serial log) but timed out waiting for 'zeta-installer login:' prompt. Root cause: NixOS installer default outputs only to VGA tty1; QEMU's -display none hides VGA; serial console (-serial file:...) gets no kernel/systemd/getty output after bootloader stage. Fix: add boot.kernelParams = [ "console=ttyS0,115200n8" "console=tty1" ] to installer config so systemd/getty mirrors to serial console at standard 115200 8N1. tty1 stays primary (keyboard-attached install flow); ttyS0 is secondary for QEMU capture + real hardware with serial headers (some Beelinks; most server-class boards; debugging scenarios). Substrate-honest framing: the QEMU test isn't broken; it correctly caught that serial console wasn't configured. The ISO itself isn't broken (boots cleanly). The MISSING config (serial console) was a real gap that the new cascade #5 surfaced on first cycle — which is exactly what the test was designed to do per its commit message: 'catches the bug class where the ISO builds + audits pass but the kernel/initrd combination fails to actually boot' (or in this case, fails to surface boot output through the channels tests can observe). Composes with: PR #5322 (the QEMU boot smoke-test workflow); B-0754 iter-3 firmware substrate (similar UX-cleanliness motivation: surface less mysterious behavior); the canonical zflash + zeta-install flow (no behavioral change for the keyboard-attached install flow since tty1 stays primary). Authored from fresh independent clone per B-0828 multi-AI shared-checkout convention. Co-authored-by: Lior <lior@zeta.dev>
This was referenced May 26, 2026
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…turn terminology distinction + split 3-PR-cleanup + follow-up-fix-PR correctly (#5329) Both Copilot findings verified + addressed: (1) multi-turn (overall conversation length) vs zero-turn (pathogen-decryption-protocol cost) are distinct scopes; clarified terminology in title + table-intro + empirical-generalization paragraph so readers don't read the table's Zero-turn entries as contradicting the multi-turn claim. (2) USB cleanup arc had 3-PR cleanup sequence (#5311 + #5320 + #5322) + follow-up fix (#5324) — split for narrative consistency. No semantic change; clarification only. Co-authored-by: Lior <lior@zeta.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
USB cleanup PR 3 of 3. Adds dynamic boot-time verification to the canonical AI-cluster ISO build pipeline. Catches the bug class where the ISO builds + audits pass but the kernel/initrd combination fails to actually boot (firmware mismatch; missing module; broken init).
Per Aaron's direction: "push iso testing closer into the ci instead of neading human to physically test usb but also after a few rounds i will physically test teh usb".
Per Kestrel's ferry pointer (PR #5310 research doc): prior art at `nixos/tests/installer.nix`.
What lands (2 files)
1. `tools/ci/qemu-boot-test.ts` (~150 lines, Rule 0 compliant)
TS helper that spawns `qemu-system-x86_64` with KVM acceleration (TCG fallback when KVM unavailable for local testing), captures serial console to log file, waits up to 5min for the installer's expected login prompt (`zeta-installer login:` — matches `networking.hostName = "zeta-installer"` in the canonical installer config), kills QEMU, returns exit code.
2. `.github/workflows/build-ai-cluster-iso.yml` extension
Adds 2 new steps AFTER the existing "Audit installer ISO content" step + BEFORE "Locate ISO + capture metadata":
No `github.event.*` interpolation in run: lines per the GitHub Actions script-injection security guide.
Verification cascade post-PR-3
Estimated CI time impact: +3-5min per build (KVM keeps it fast vs TCG).
What this is NOT (substrate-honest defer list)
This is the SIMPLEST viable boot test. Once it lands + runs across a few cycles + catches at least one real boot regression (or demonstrates none for N runs), Aaron's physical USB test gate fires.
Composes with
Test plan