ci: Freeze kernel at 5.6.7 due to loop regression breaking blackbox test#976
Conversation
It seems like there's a regression in the 5.6.8 kernel causing our
blackbox tests to fail with e.g.:
```
blackbox_test.go:114: failed: "mkfs.vfat": exit status 1
mkfs.vfat: unable to open /dev/loop0p1: No such file or directory
mkfs.fat 4.1 (2017-01-24)
```
And looking at dmesg, one can see the partition rescan is failing with
`-EBUSY`:
```
__loop_clr_fd: partition scan of loop3 failed (rc=-16)
loop_reread_partitions: partition scan of loop0 (/var/tmp/ignition-blackbox-570148150/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop1 (/var/tmp/ignition-blackbox-134124829/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop2 (/var/tmp/ignition-blackbox-492917208/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop3 (/var/tmp/ignition-blackbox-966528855/hd0) failed (rc=-16)
```
Looking at the 5.6.8 notes, the only commit that jumps out is
https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop
devices backed by block devices.
The only other report I found of this is:
https://bugs.archlinux.org/task/66526
Anyway, I don't think this is a serious enough regression to hold the
kernel in FCOS. But I really want blackbox tests to work in CoreOS CI
where it's easy for everyone to inspect results, download, retry, etc..
So let's override the kernel for now.
According to the Arch Linux bug, it seems like it's partially fixed in
5.7 (though I haven't tried it), so we should be able to unfreeze it
then (or if we want, fast-track it once there's a build for f32).
arithx
left a comment
There was a problem hiding this comment.
This is awesome! Thanks for tracking it down.
Looks like CI had an unrelated networking issue.
|
Hmm weird, it's still hitting:
But I can't reproduce this locally after coreos/coreos-assembler#1432. Looking. |
5d48013 to
7edf7a3
Compare
|
Wow, that took a while to figure out. So, here we're using the cosa buildroot image. I thought this was fine though, because we automatically rebuild it whenever a cosa image is built. And the latest cosa buildroot image has: So one would think that it has that commit. Yet, adding
Which is coreos/coreos-assembler@5a07e8a, which is the parent commit of coreos/coreos-assembler@4e60560. I thought maybe the CentOS CI cluster downloaded a stale version of the image when running the tests here. Yet, doing an which matched the image ID of the latest buildroot image. So it definitely was pulling the latest. What was actually happening was that somehow the Which meant that the I re-added those lines and restarted another |
Oh man, that's just evil. |
Wow 😦 Agreed. Great debugging work @jlebon! |
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out.
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out. But don't error out if it somehow doesn't exist.
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out. But don't error out if it somehow doesn't exist.
It seems like there's a regression in the 5.6.8 kernel causing our
blackbox tests to fail with e.g.:
And looking at dmesg, one can see the partition rescan is failing with
-EBUSY:Looking at the 5.6.8 notes, the only commit that jumps out is
https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop
devices backed by block devices.
The only other report I found of this is:
https://bugs.archlinux.org/task/66526
Anyway, I don't think this is a serious enough regression to hold the
kernel in FCOS. But I really want blackbox tests to work in CoreOS CI
where it's easy for everyone to inspect results, download, retry, etc..
So let's override the kernel for now.
According to the Arch Linux bug, it seems like it's partially fixed in
5.7 (though I haven't tried it), so we should be able to unfreeze it
then (or if we want, fast-track it once there's a build for f32).