scripts/boot-image: fix cross-arch smoke tests#2103
Conversation
ebf115e to
206c8a8
Compare
|
Look at that, it shows that something is indeed a posixpath. FWIW I vaguely recall that |
|
Yeah, found it. It was not only cdrom, but also memory (integer) and possibly extra arguments which can be freely passed. Squashed it into the "cdrom" commit. Also found out the CPU model I set was too modern some hosts did not have it, so trying out with the "oldest" possible which has Intel-v3 features: Finally, I was missing the most important fix in the last rebase: removing the Let's see how much green can I get this time :-) |
Seems a lot greener already! |
|
Still couple of problems. PXE-TAR on intel timeouts, this is weird, there is no output. When I try this locally (with RHEL10) I get console output just fine, everything works but then the reset event is somehow not catched and it gets stuck forever: When I cancel I do see a problem possibly with the HTTP server, maybe it is the reason why PXE-TAR tests hang: I remember it was me who refactored the HTTP server from try/catch to with. If I put that back, it works locally. So adding another commit that puts the old behavior with explicit termination to prevent HTTP process to hang. It calls And then many aarch64 tests do fail because the "max" CPU type is problematic, I am putting back the Cortex type which is safer, but I still want to try the tcg option which is supposed to speed up emulation on multi-core instances. Actually, instead of harcoded 2, I am adding a code that will detect amount of cores since we run these boot tests on EC2 and in the future we might have more than 2 cores (currently we only have 2 so this will not speed things up right now). Rebasing, crossing my fingers :-) |
f215635 to
e93a3b8
Compare
|
I only now see Fedora 42/43 on aarch64 failing to boot, something is wrong with the boot partition, could this be related to BTRFS changes we recently merged @supakeen ? What I see are pxe-tar-xz and qcow2 (all-customizations) failing being unable to find /boot/efi. But we enabled it only on cloud image, I can confirm that these images do indeed have ext4: Then I also see Fedora on aarch64 running out of space on the VM image during boot, not sure how that is related. Several image configurations fail, I also see lack of RNG entropy so SSH keys might delay the boot, I am going to add a RNG passthrough to see if that helps. But since these are Fedoras this might again be related to BTRFS change? |
We only changed to btrfs for the Cloud images (so: Looking at the latest test run I do not see any disk space errors (I think) on aarch64; I mostly see Can you link me to a run where an image runs out of disk space during boot? |
|
For example: https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12511371522 I need to take a closer look that might be not actually full disk. Edit: Nah, this was not it. I saw other failures with many services failing because disk was full, but I cannot dig it anymore. |
|
So I see few types of failures. First, KVM fails to load, probably because we already run this in AWS which does not support nested KVM. I do not understand why this worked before?! @ondrejbudai left a comment on Slack that these should run on OpenStack and not AWS, unsure how to achieve that tho. Then there is the UUID problem: Finally, aarch64 emulating is slow in some instances hitting the 600 seconds limit, increasing it a tad to see if it helps. |
|
Found out why |
|
Nice, only Fedoras on aarch64 fails to boot, it has a common problem with storage. Check those three failures out @supakeen thanks. |
Thank you. It seems to me that these are just really slow. Are we running these emulated on x86 hosts? If so we might just want to bump the timeout quite aggressively. Say go to 1800 instead of 800? Alternatively; we can temporarily disable the aarch64 boot-tests in GitLab until we find some aarch64 runners? This might be the more practical answer here for now as we're already testing at least (some of) the image types on clouds that have aarch64. I'm saying this because I see even in initramfs it takes almost 90 seconds to find the root filesystem, and that is only bound to get slower as more stuff runs after switch root; which then times out, it retries, etc. |
|
Good idea, squashed 1800 that is 3 times more, if this does not work, well, we might need to disable this. Yeah, this is full emulation I actually tried to solve this by adding some optimalization (multi-threading) but it did not help too much. |
supakeen
left a comment
There was a problem hiding this comment.
I'll approve but I think we still have paths where we skip images (for example CannotRunQemuTest when jq is not found?).
ye, trying to do a more extensive rework here #2107 |
|
Sanne's last commit brings back the issue that was resolved (as a workaround) with #2075. We'll keep doing rebuilds if we merge this as it is now. |
|
Here's an example of the problem: Fedora 42, x86_64, cloud-qcow2 doesn't have So the job creates a dynamic build step for it. Then, when it's built, we get Meaning, if you rerun the pipeline (or on the next PR), it will get built again. Every time. This configuration (distro, architecture, image type, blueprint) will always look like it can be boot tested when running the build pipeline generator, but it never will be boot tested in the actual build step. |
|
#2107 fixes the problem. https://gitlab.com/redhat/services/products/image-builder/ci/images/-/jobs/12608297275#L2126 |
Cross arch testing does not work properly, the QEMU command line renders
image twice. Once as device option and once as an argument. On top of
that, arguments are incorrectly formatted since Path object is getting
printed:
PosixPath('/var/tmp/tmpdk5gg6b6/netinst.iso'), PosixPath('/var/tmp/tmpp40bxujp/disk.img')
QEMU complains about raw image that cannot be auto-detected because it
is empty. The problem is, for installer ISO we create an empty raw image
and since there is zero content it cannot be autodetected. But the
format= argument is passed as qcow2 which is not correct in this case.
Passing cpu=host creates inconsistent environment when running on various laptops locally or even in AWS when running on different instance types. It is better to be explicit to always use the same CPU feature set. This patch uses Skylake family which is x86_64-v3 which is required by RHEL10. It also adds few extra features for better performance. Finally, Q35 is used instead of the legacy platform.
CDROM image was appended with Path("xxx") prefix, this fixes it and also
adds a nice debug output which is pretty useful when trying to launch an
image locally with the same options.
Adds cache=unsafe to the main QEMU drive. This speeds up writes at the cost of decreased reliability on power outage. This is not a concern in CICD pipeline so there are no drawbacks.
Enable multi-threaded emulation for ARM. I tried to set the CPU to "max" but it didn't work so let's use the default Cortex-A57 CPU.
The main disk image device is passed via -drive and then once again as an argument to qemu. This is redundant and causes issues with qemu being unable to detect the image format.
We hardcode the number of cores to use in the QEMU command line. This is not ideal, as it makes it harder to test with different numbers of cores. Instead, we should use the `-smp` argument to QEMU to specify the number of cores to use. This is more flexible and allows us to test with different numbers of cores.
Explicitly terminate the HTTP server on exit to avoid blocking on wait(). The with method does not guarantee that the process will be terminated.
I noticed a warning during boot: The VM might be stalling while waiting for enough entropy to generate SSH host keys. Let's pass /dev/urandom to the VM to ensure there is always enough entropy. This is not a security issue for testing VMs as they are discarded immediately.
On some instances when emulating aarch64, the system was not fully booted after 600 seconds, let's give it slightly more time.
This needs to be executed on OpenStack.
`boot-image` will now raise an exception if the image is bootable, but no suitable implementation was found, as this is a programmer error. Also, `boot-image` will first use the same logic as the build pipeline generator to check if an image type can be boot tested. This will later be expanded to include the more detailed boot-test skipping logic.
80977e2 to
461065d
Compare
Let's get this in, and then i'll rebase 2107 |
Cross arch testing does not work properly, the QEMU command line renders image twice. Once as device option and once as an argument. On top of that, arguments are incorrectly formatted since Path object is getting printed:
PosixPath('/var/tmp/tmpdk5gg6b6/netinst.iso'), PosixPath('/var/tmp/tmpp40bxujp/disk.img')
QEMU complains about raw image that cannot be auto-detected because it is empty. The problem is, for installer ISO we create an empty raw image and since there is zero content it cannot be autodetected. But the format= argument is passed as qcow2 which is not correct in this case.
As part of this PR, I have also modernized the CPU little bit which was required in order to use virtio. On my system with recent Intel CPU, I was getting some panics and I had to enable Q35 model and also I must not pass
-cpu hostwhich passed some very modern features which did not work well. Instead, I selected a relatively modern model that has x86_64-v3 set required by RHEL10 which makes it more stable across various laptops/servers we locally test on.Also some other smaller things like
cache=unsafeor-cpu maxfor ARM, see the individual commits.I am pulling the terraform update which should bring new instances, hopefully this can make things a bit faster at times.