Skip to content

scripts/boot-image: fix cross-arch smoke tests#2103

Merged
supakeen merged 13 commits intoosbuild:mainfrom
lzap:cross-fix-1
Jan 6, 2026
Merged

scripts/boot-image: fix cross-arch smoke tests#2103
supakeen merged 13 commits intoosbuild:mainfrom
lzap:cross-fix-1

Conversation

@lzap
Copy link
Contributor

@lzap lzap commented Dec 19, 2025

Cross arch testing does not work properly, the QEMU command line renders image twice. Once as device option and once as an argument. On top of that, arguments are incorrectly formatted since Path object is getting printed:

PosixPath('/var/tmp/tmpdk5gg6b6/netinst.iso'), PosixPath('/var/tmp/tmpp40bxujp/disk.img')

QEMU complains about raw image that cannot be auto-detected because it is empty. The problem is, for installer ISO we create an empty raw image and since there is zero content it cannot be autodetected. But the format= argument is passed as qcow2 which is not correct in this case.

As part of this PR, I have also modernized the CPU little bit which was required in order to use virtio. On my system with recent Intel CPU, I was getting some panics and I had to enable Q35 model and also I must not pass -cpu host which passed some very modern features which did not work well. Instead, I selected a relatively modern model that has x86_64-v3 set required by RHEL10 which makes it more stable across various laptops/servers we locally test on.

Also some other smaller things like cache=unsafe or -cpu max for ARM, see the individual commits.

I am pulling the terraform update which should bring new instances, hopefully this can make things a bit faster at times.

@lzap lzap force-pushed the cross-fix-1 branch 2 times, most recently from ebf115e to 206c8a8 Compare December 20, 2025 21:10
@supakeen
Copy link
Member

supakeen commented Dec 21, 2025

Look at that, it shows that something is indeed a posixpath. FWIW I vaguely recall that subprocess might turn its args into str or at least do so for pathlike's but I'm not entirely sure so you might be chasing a red herring where the print shows the repr but not the str?

  File "/usr/local/lib/python3.14/site-packages/vmtest/vm.py", line 246, in _gen_qemu_cmdline
    print("QEMU: " + " ".join(qemu_cmdline))
                     ~~~~~~~~^^^^^^^^^^^^^^
TypeError: sequence item 27: expected str instance, PosixPath found

@lzap
Copy link
Contributor Author

lzap commented Dec 21, 2025

Yeah, found it. It was not only cdrom, but also memory (integer) and possibly extra arguments which can be freely passed. Squashed it into the "cdrom" commit.

Also found out the CPU model I set was too modern some hosts did not have it, so trying out with the "oldest" possible which has Intel-v3 features: Haswell-v4 let's see how that works.

Finally, I was missing the most important fix in the last rebase: removing the self._img as the last argument. The disk image was actually passed twice causing the QEMU not being able to detect format for empty raw images we use for ISO installers.

Let's see how much green can I get this time :-)

@supakeen
Copy link
Member

Finally, I was missing the most important fix in the last rebase: removing the self._img as the last argument. The disk image was actually passed twice causing the QEMU not being able to detect format for empty raw images we use for ISO installers.

Seems a lot greener already!

supakeen
supakeen previously approved these changes Dec 21, 2025
@lzap
Copy link
Contributor Author

lzap commented Dec 21, 2025

Still couple of problems. PXE-TAR on intel timeouts, this is weird, there is no output.

Testing image at build/fedora_42-x86_64-pxe_tar_xz-empty/xz/pxe.tar.xz
QEMU: qemu-system-x86_64 -M q35,accel=kvm -cpu Haswell-v4 -device virtio-scsi-pci,id=scsi -device scsi-hd,drive=disk0 -m 3000 -serial stdio -monitor none -device virtio-net-pci,netdev=net.0,id=net.0 -netdev user,id=net.0,hostfwd=tcp::45923-:22 -qmp unix:/tmp/vmtest-2w2uya6_-disk.img/qmp.socket,server,nowait -drive file=/var/tmp/tmpzya6dkc6/disk.img,if=none,id=disk0,cache=unsafe,format=raw -nographic -kernel /var/tmp/tmpzya6dkc6/vmlinuz -initrd /var/tmp/tmpzya6dkc6/combined.img -append rd.live.image root=live:/rootfs.img console=ttyS0 systemd.debug-shell=ttyS0 systemd.mask=serial-getty@ttyS0.service systemd.unit=reboot.target
qemu-system-x86_64: Could not access KVM kernel module: No such file or directory
qemu-system-x86_64: failed to initialize kvm: No such file or directory
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ci/runners/configure_runners/#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 2h30m0s seconds

When I try this locally (with RHEL10) I get console output just fine, everything works but then the reset event is somehow not catched and it gets stuck forever:

[    6.645133] systemd-shutdown[1]: Rebooting.
[    6.646809] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[    6.650588] ACPI: PM: Preparing to enter system sleep state S5
[    6.652404] reboot: Restarting system
[    6.653450] reboot: machine restart
DEBUG: got event {'timestamp': {'seconds': 1766345911, 'microseconds': 620784}, 'event': 'RESET', 'data': {'guest': True, 'reason': 'guest-reset'}}
qmp event RESET

When I cancel I do see a problem possibly with the HTTP server, maybe it is the reason why PXE-TAR tests hang:

Traceback (most recent call last):
  File "/home/lzap/images/test/scripts/boot-image", line 594, in <module>
    main()
    ~~~~^^
  File "/home/lzap/images/test/scripts/boot-image", line 560, in main
    boot_qemu_pxe(arch, image_path)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/lzap/images/test/scripts/boot-image", line 332, in boot_qemu_pxe
    with subprocess.Popen(
         ~~~~~~~~~~~~~~~~^
        ["python3", "-m", "http.server", f"{http_port}"],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        stderr=subprocess.DEVNULL
        ^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
  File "/usr/lib64/python3.14/subprocess.py", line 1129, in __exit__
    self.wait()
    ~~~~~~~~~^^
  File "/usr/lib64/python3.14/subprocess.py", line 1278, in wait
    return self._wait(timeout=timeout)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.14/subprocess.py", line 2064, in _wait
    (pid, sts) = self._try_wait(0)
                 ~~~~~~~~~~~~~~^^^
  File "/usr/lib64/python3.14/subprocess.py", line 2022, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
                 ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

I remember it was me who refactored the HTTP server from try/catch to with. If I put that back, it works locally. So adding another commit that puts the old behavior with explicit termination to prevent HTTP process to hang. It calls terminate() explicitly.

And then many aarch64 tests do fail because the "max" CPU type is problematic, I am putting back the Cortex type which is safer, but I still want to try the tcg option which is supposed to speed up emulation on multi-core instances. Actually, instead of harcoded 2, I am adding a code that will detect amount of cores since we run these boot tests on EC2 and in the future we might have more than 2 cores (currently we only have 2 so this will not speed things up right now).

Rebasing, crossing my fingers :-)

@lzap lzap force-pushed the cross-fix-1 branch 2 times, most recently from f215635 to e93a3b8 Compare December 21, 2025 19:57
@lzap
Copy link
Contributor Author

lzap commented Dec 22, 2025

I only now see Fedora 42/43 on aarch64 failing to boot, something is wrong with the boot partition, could this be related to BTRFS changes we recently merged @supakeen ? What I see are pxe-tar-xz and qcow2 (all-customizations) failing being unable to find /boot/efi. But we enabled it only on cloud image, I can confirm that these images do indeed have ext4:

(venv) lzap@dev:~/images$ virt-filesystems -a build/fedora_44-aarch64-qcow2-import_rpm_gpg_keys_from_tree_fedora/qcow2/disk.qcow2 --all --long --uuid
Name      Type       VFS  Label MBR Size       Parent   UUID
/dev/sda1 filesystem vfat ESP   -   209489920  -        09F3-B305
/dev/sda2 filesystem ext4 boot  -   2040373248 -        2f0db0f7-13e0-4b5f-8425-455f7bb9066d
/dev/sda3 filesystem ext4 root  -   3094028288 -        c475c74d-3f0d-4b54-89e1-e2b6935aa755
/dev/sda1 partition  -    -     -   209715200  /dev/sda -
/dev/sda2 partition  -    -     -   2147483648 /dev/sda -
/dev/sda3 partition  -    -     -   3222257152 /dev/sda -
/dev/sda  device     -    -     -   5580521472 -        -

(venv) lzap@dev:~/images$ virt-filesystems -a build/fedora_44-x86_64-qcow2-import_rpm_gpg_keys_from_tree_fedora/qcow2/disk.qcow2 --all --long --uuid
Name      Type       VFS     Label MBR Size       Parent   UUID
/dev/sda1 filesystem unknown -     -   1048576    -        -
/dev/sda2 filesystem vfat    ESP   -   209489920  -        7118-63EE
/dev/sda3 filesystem ext4    boot  -   2040373248 -        491e50ab-b976-4b3e-92ee-787c9871f463
/dev/sda4 filesystem ext4    root  -   3094028288 -        a416ef13-b72b-48e4-b0c4-43cb87618b57
/dev/sda1 partition  -       -     -   1048576    /dev/sda -
/dev/sda2 partition  -       -     -   209715200  /dev/sda -
/dev/sda3 partition  -       -     -   2147483648 /dev/sda -
/dev/sda4 partition  -       -     -   3222257152 /dev/sda -
/dev/sda  device     -       -     -   5581570048 -        -

Then I also see Fedora on aarch64 running out of space on the VM image during boot, not sure how that is related. Several image configurations fail, I also see lack of RNG entropy so SSH keys might delay the boot, I am going to add a RNG passthrough to see if that helps. But since these are Fedoras this might again be related to BTRFS change?

@supakeen
Copy link
Member

I only now see Fedora 42/43 on aarch64 failing to boot, something is wrong with the boot partition, could this be related to BTRFS changes we recently merged @supakeen ? What I see are pxe-tar-xz and qcow2 (all-customizations) failing being unable to find /boot/efi. But we enabled it only on cloud image, I can confirm that these images do indeed have ext4:

(venv) lzap@dev:~/images$ virt-filesystems -a build/fedora_44-aarch64-qcow2-import_rpm_gpg_keys_from_tree_fedora/qcow2/disk.qcow2 --all --long --uuid
Name      Type       VFS  Label MBR Size       Parent   UUID
/dev/sda1 filesystem vfat ESP   -   209489920  -        09F3-B305
/dev/sda2 filesystem ext4 boot  -   2040373248 -        2f0db0f7-13e0-4b5f-8425-455f7bb9066d
/dev/sda3 filesystem ext4 root  -   3094028288 -        c475c74d-3f0d-4b54-89e1-e2b6935aa755
/dev/sda1 partition  -    -     -   209715200  /dev/sda -
/dev/sda2 partition  -    -     -   2147483648 /dev/sda -
/dev/sda3 partition  -    -     -   3222257152 /dev/sda -
/dev/sda  device     -    -     -   5580521472 -        -

(venv) lzap@dev:~/images$ virt-filesystems -a build/fedora_44-x86_64-qcow2-import_rpm_gpg_keys_from_tree_fedora/qcow2/disk.qcow2 --all --long --uuid
Name      Type       VFS     Label MBR Size       Parent   UUID
/dev/sda1 filesystem unknown -     -   1048576    -        -
/dev/sda2 filesystem vfat    ESP   -   209489920  -        7118-63EE
/dev/sda3 filesystem ext4    boot  -   2040373248 -        491e50ab-b976-4b3e-92ee-787c9871f463
/dev/sda4 filesystem ext4    root  -   3094028288 -        a416ef13-b72b-48e4-b0c4-43cb87618b57
/dev/sda1 partition  -       -     -   1048576    /dev/sda -
/dev/sda2 partition  -       -     -   209715200  /dev/sda -
/dev/sda3 partition  -       -     -   2147483648 /dev/sda -
/dev/sda4 partition  -       -     -   3222257152 /dev/sda -
/dev/sda  device     -       -     -   5581570048 -        -

Then I also see Fedora on aarch64 running out of space on the VM image during boot, not sure how that is related. Several image configurations fail, I also see lack of RNG entropy so SSH keys might delay the boot, I am going to add a RNG passthrough to see if that helps. But since these are Fedoras this might again be related to BTRFS change?

We only changed to btrfs for the Cloud images (so: cloud-azure, cloud-ec2, cloud-gce, cloud-qcow2) on all architectures except s390x; not the pxe-tar-xz image type.

Looking at the latest test run I do not see any disk space errors (I think) on aarch64; I mostly see ConnectionRefusedError: cannot connect to port 47555 after 600s there and none of it on any cloud-* images so I don't think that's related here.

Can you link me to a run where an image runs out of disk space during boot?

@lzap
Copy link
Contributor Author

lzap commented Dec 23, 2025

For example: https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12511371522

systemd[1]: Failed to populate /etc with preset unit settings, ignoring: Read-only file system

I need to take a closer look that might be not actually full disk.

Edit: Nah, this was not it. I saw other failures with many services failing because disk was full, but I cannot dig it anymore.

@lzap
Copy link
Contributor Author

lzap commented Jan 1, 2026

So I see few types of failures. First, KVM fails to load, probably because we already run this in AWS which does not support nested KVM. I do not understand why this worked before?!

Testing image at build/fedora_43-x86_64-pxe_tar_xz-jq_only/xz/pxe.tar.xz
QEMU: qemu-system-x86_64 -M q35,accel=kvm -cpu Haswell-v4 -device virtio-scsi-pci,id=scsi -device scsi-hd,drive=disk0 -m 3000 -serial stdio -monitor none -device virtio-net-pci,netdev=net.0,id=net.0 -netdev user,id=net.0,hostfwd=tcp::45961-:22 -qmp unix:/tmp/vmtest-stc7pfa8-disk.img/qmp.socket,server,nowait -drive file=/var/tmp/tmp1dqkvbhu/disk.img,if=none,id=disk0,cache=unsafe,format=raw -nographic -kernel /var/tmp/tmp1dqkvbhu/vmlinuz -initrd /var/tmp/tmp1dqkvbhu/combined.img -append rd.live.image root=live:/rootfs.img console=ttyS0 systemd.debug-shell=ttyS0 systemd.mask=serial-getty@ttyS0.service systemd.unit=reboot.target
qemu-system-x86_64: Could not access KVM kernel module: No such file or directory
qemu-system-x86_64: failed to initialize kvm: No such file or directory

@ondrejbudai left a comment on Slack that these should run on OpenStack and not AWS, unsure how to achieve that tho.

Then there is the UUID problem:

[  267.158666] systemd[1]: dev-disk-by\x2duuid-7828\x2d141D.device: Job dev-disk-by\x2duuid-7828\x2d141D.device/start timed out.
[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-7828\x2d141D.device - /dev/disk/by-uuid/7828-141D.

Finally, aarch64 emulating is slow in some instances hitting the 600 seconds limit, increasing it a tad to see if it helps.

@lzap
Copy link
Contributor Author

lzap commented Jan 2, 2026

Found out why pxe-tar-xz is being executed on AWS, fixed it. Not sure about the other failures tho.

@lzap
Copy link
Contributor Author

lzap commented Jan 3, 2026

Nice, only Fedoras on aarch64 fails to boot, it has a common problem with storage. Check those three failures out @supakeen thanks.

@supakeen
Copy link
Member

supakeen commented Jan 5, 2026

Nice, only Fedoras on aarch64 fails to boot, it has a common problem with storage. Check those three failures out @supakeen thanks.

Thank you. It seems to me that these are just really slow. Are we running these emulated on x86 hosts? If so we might just want to bump the timeout quite aggressively. Say go to 1800 instead of 800?

Alternatively; we can temporarily disable the aarch64 boot-tests in GitLab until we find some aarch64 runners? This might be the more practical answer here for now as we're already testing at least (some of) the image types on clouds that have aarch64.

I'm saying this because I see even in initramfs it takes almost 90 seconds to find the root filesystem, and that is only bound to get slower as more stuff runs after switch root; which then times out, it retries, etc.

@lzap
Copy link
Contributor Author

lzap commented Jan 5, 2026

Good idea, squashed 1800 that is 3 times more, if this does not work, well, we might need to disable this. Yeah, this is full emulation I actually tried to solve this by adding some optimalization (multi-threading) but it did not help too much.

achilleas-k
achilleas-k previously approved these changes Jan 5, 2026
Copy link
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Thank you!!

supakeen
supakeen previously approved these changes Jan 5, 2026
Copy link
Member

@supakeen supakeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve but I think we still have paths where we skip images (for example CannotRunQemuTest when jq is not found?).

@croissanne
Copy link
Member

I'll approve but I think we still have paths where we skip images (for example CannotRunQemuTest when jq is not found?).

ye, trying to do a more extensive rework here #2107

@supakeen supakeen requested a review from achilleas-k January 5, 2026 20:55
@achilleas-k
Copy link
Member

Sanne's last commit brings back the issue that was resolved (as a workaround) with #2075. We'll keep doing rebuilds if we merge this as it is now.

@achilleas-k
Copy link
Member

achilleas-k commented Jan 6, 2026

Here's an example of the problem:

Fedora 42, x86_64, cloud-qcow2 doesn't have boot-success: true, but it is in the list of image types that can_boot_test().
https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12607089151#L2078

🖼️ Manifest fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora.json was successfully built in commit 92493888e0b6157b987f4998b46ba3aa2941e25b
  https://github.com/osbuild/images/commit/92493888e0b6157b987f4998b46ba3aa2941e25b
  PR-2106: https://github.com/osbuild/images/pull/2106
  Boot test success not found.

So the job creates a dynamic build step for it.

Then, when it's built, we get
https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12607399243#L2692

✅ Build finished!!
Testing image at build/fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora/qcow2/disk.qcow2
WARNING: skipping build/fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora/qcow2/disk.qcow2: no jq package in image

Meaning, if you rerun the pipeline (or on the next PR), it will get built again. Every time.

This configuration (distro, architecture, image type, blueprint) will always look like it can be boot tested when running the build pipeline generator, but it never will be boot tested in the actual build step.

@achilleas-k
Copy link
Member

achilleas-k commented Jan 6, 2026

#2107 fixes the problem.

https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12608297275#L2126

🖼️ Manifest fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora.json was successfully built in commit 8c0fc8243234f753c606a6ccc1f914c082c56cee
  https://github.com/osbuild/images/commit/8c0fc8243234f753c606a6ccc1f914c082c56cee
  PR-2103: https://github.com/osbuild/images/pull/2103
not bootable: jq not found in fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora.json (x86_64 cloud-qcow2)
  Boot testing for cloud-qcow2 is not yet supported

lzap added 12 commits January 6, 2026 10:38
Cross arch testing does not work properly, the QEMU command line renders
image twice. Once as device option and once as an argument. On top of
that, arguments are incorrectly formatted since Path object is getting
printed:

PosixPath('/var/tmp/tmpdk5gg6b6/netinst.iso'), PosixPath('/var/tmp/tmpp40bxujp/disk.img')

QEMU complains about raw image that cannot be auto-detected because it
is empty. The problem is, for installer ISO we create an empty raw image
and since there is zero content it cannot be autodetected. But the
format= argument is passed as qcow2 which is not correct in this case.
Passing cpu=host creates inconsistent environment when running on
various laptops locally or even in AWS when running on different
instance types. It is better to be explicit to always use the same CPU
feature set.

This patch uses Skylake family which is x86_64-v3 which is required by
RHEL10. It also adds few extra features for better performance. Finally,
Q35 is used instead of the legacy platform.
CDROM image was appended with Path("xxx") prefix, this fixes it and also
adds a nice debug output which is pretty useful when trying to launch an
image locally with the same options.
Adds cache=unsafe to the main QEMU drive. This speeds up writes at the
cost of decreased reliability on power outage. This is not a concern in
CICD pipeline so there are no drawbacks.
Enable multi-threaded emulation for ARM. I tried to set the CPU to "max" but
it didn't work so let's use the default Cortex-A57 CPU.
The main disk image device is passed via -drive and then once again as
an argument to qemu. This is redundant and causes issues with qemu being
unable to detect the image format.
We hardcode the number of cores to use in the QEMU command line. This is not
ideal, as it makes it harder to test with different numbers of cores.

Instead, we should use the `-smp` argument to QEMU to specify the number of
cores to use. This is more flexible and allows us to test with different
numbers of cores.
Explicitly terminate the HTTP server on exit to avoid blocking on wait().
The with method does not guarantee that the process will be terminated.
I noticed a warning during boot: The VM might be stalling while waiting
for enough entropy to generate SSH host keys. Let's pass /dev/urandom to
the VM to ensure there is always enough entropy. This is not a security
issue for testing VMs as they are discarded immediately.
On some instances when emulating aarch64, the system was not fully
booted after 600 seconds, let's give it slightly more time.
This needs to be executed on OpenStack.
`boot-image` will now raise an exception if the image is bootable, but
no suitable implementation was found, as this is a programmer error.

Also, `boot-image` will first use the same logic as the build pipeline
generator to check if an image type can be boot tested.  This will later
be expanded to include the more detailed boot-test skipping logic.
@croissanne
Copy link
Member

#2107 fixes the problem.

https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12608297275#L2126

🖼️ Manifest fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora.json was successfully built in commit 8c0fc8243234f753c606a6ccc1f914c082c56cee
  https://github.com/osbuild/images/commit/8c0fc8243234f753c606a6ccc1f914c082c56cee
  PR-2103: https://github.com/osbuild/images/pull/2103
not bootable: jq not found in fedora_42-x86_64-cloud_qcow2-import_rpm_gpg_keys_from_tree_fedora.json (x86_64 cloud-qcow2)
  Boot testing for cloud-qcow2 is not yet supported

Let's get this in, and then i'll rebase 2107

@supakeen supakeen added this pull request to the merge queue Jan 6, 2026
Merged via the queue into osbuild:main with commit 09a3c49 Jan 6, 2026
25 checks passed
@lzap lzap deleted the cross-fix-1 branch January 7, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants