ci: Freeze kernel at 5.6.7 due to loop regression breaking blackbox test by jlebon · Pull Request #976 · coreos/ignition

jlebon · 2020-05-07T16:14:29Z

It seems like there's a regression in the 5.6.8 kernel causing our
blackbox tests to fail with e.g.:

blackbox_test.go:114: failed: "mkfs.vfat": exit status 1
    mkfs.vfat: unable to open /dev/loop0p1: No such file or directory
    mkfs.fat 4.1 (2017-01-24)

And looking at dmesg, one can see the partition rescan is failing with
-EBUSY:

__loop_clr_fd: partition scan of loop3 failed (rc=-16)
loop_reread_partitions: partition scan of loop0 (/var/tmp/ignition-blackbox-570148150/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop1 (/var/tmp/ignition-blackbox-134124829/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop2 (/var/tmp/ignition-blackbox-492917208/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop3 (/var/tmp/ignition-blackbox-966528855/hd0) failed (rc=-16)

Looking at the 5.6.8 notes, the only commit that jumps out is
https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop
devices backed by block devices.

The only other report I found of this is:
https://bugs.archlinux.org/task/66526

Anyway, I don't think this is a serious enough regression to hold the
kernel in FCOS. But I really want blackbox tests to work in CoreOS CI
where it's easy for everyone to inspect results, download, retry, etc..
So let's override the kernel for now.

According to the Arch Linux bug, it seems like it's partially fixed in
5.7 (though I haven't tried it), so we should be able to unfreeze it
then (or if we want, fast-track it once there's a build for f32).

It seems like there's a regression in the 5.6.8 kernel causing our blackbox tests to fail with e.g.: ``` blackbox_test.go:114: failed: "mkfs.vfat": exit status 1 mkfs.vfat: unable to open /dev/loop0p1: No such file or directory mkfs.fat 4.1 (2017-01-24) ``` And looking at dmesg, one can see the partition rescan is failing with `-EBUSY`: ``` __loop_clr_fd: partition scan of loop3 failed (rc=-16) loop_reread_partitions: partition scan of loop0 (/var/tmp/ignition-blackbox-570148150/hd0) failed (rc=-16) loop_reread_partitions: partition scan of loop1 (/var/tmp/ignition-blackbox-134124829/hd0) failed (rc=-16) loop_reread_partitions: partition scan of loop2 (/var/tmp/ignition-blackbox-492917208/hd0) failed (rc=-16) loop_reread_partitions: partition scan of loop3 (/var/tmp/ignition-blackbox-966528855/hd0) failed (rc=-16) ``` Looking at the 5.6.8 notes, the only commit that jumps out is https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop devices backed by block devices. The only other report I found of this is: https://bugs.archlinux.org/task/66526 Anyway, I don't think this is a serious enough regression to hold the kernel in FCOS. But I really want blackbox tests to work in CoreOS CI where it's easy for everyone to inspect results, download, retry, etc.. So let's override the kernel for now. According to the Arch Linux bug, it seems like it's partially fixed in 5.7 (though I haven't tried it), so we should be able to unfreeze it then (or if we want, fast-track it once there's a build for f32).

arithx

This is awesome! Thanks for tracking it down.

Looks like CI had an unrelated networking issue.

jlebon · 2020-05-07T16:28:31Z

coreos/coreos-assembler#1431

jlebon · 2020-05-07T19:18:44Z

Hmm weird, it's still hitting:

Downloading from 'fedora-coreos-pool'...done
error: Cannot download Packages/k/kernel-core-5.6.7-200.fc31.x86_64.rpm: All mirrors were tried; Last error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/repos-dist/coreos-pool/latest/x86_64/Packages/k/kernel-5.6.7-200.fc31.x86_64.rpm [Could not resolve host: kojipkgs.fedoraproject.org]

But I can't reproduce this locally after coreos/coreos-assembler#1432. Looking.

jlebon · 2020-05-07T21:14:27Z

Wow, that took a while to figure out. So, here we're using the cosa buildroot image. I thought this was fine though, because we automatically rebuild it whenever a cosa image is built. And the latest cosa buildroot image has:

io.openshift.build.commit.author=Jonathan Lebon \u003cjonathan@jlebon.com\u003e 
io.openshift.build.commit.date=Thu May 7 12:51:02 2020 -0400 
io.openshift.build.commit.id=4e6056029a258d3cb08bacffa1e4014e0daa0294 
io.openshift.build.commit.message=cmdlib: Lower cost of cosa RPM overrides repo 
io.openshift.build.commit.ref=master 
io.openshift.build.name=cosa-buildroot-266 
io.openshift.build.namespace=coreos 
io.openshift.build.source-location=https://github.com/coreos/coreos-assembler

So one would think that it has that commit. Yet, adding cat /cosa/coreos-assembler-git.json to CI here shows:

"commit": "5a07e8a5aef1bdca2272e22cbd9aaed142819f8b",

Which is coreos/coreos-assembler@5a07e8a, which is the parent commit of coreos/coreos-assembler@4e60560.

I thought maybe the CentOS CI cluster downloaded a stale version of the image when running the tests here. Yet, doing an oc describe pod pod-ba8ac29e-4eef-4359-bf58-d2a500eaf5f8-p4f42-2c42l shows:

Containers:
  worker:
    Container ID:       docker://c7b0eb81a499932eb102ced6f83da41001e326d0ab49a61e8c2220b70eddc92f
    Image:              registry.svc.ci.openshift.org/coreos/cosa-buildroot:latest
    Image ID:           docker-pullable://registry.svc.ci.openshift.org/coreos/cosa-buildroot@sha256:fc947ef06299984f290a58515cb2d9b5bc8d5e28e3662c62da2a9685627b89ca

which matched the image ID of the latest buildroot image. So it definitely was pulling the latest.

What was actually happening was that somehow the cosa-buildroot buildconfig in the cluster had lost this bit:

https://github.com/openshift/release/blob/09eaeb175f82b8e71838a39c45f17dc199505852/services/coreos/cosa-buildroot.yaml#L28-L30

Which meant that the Dockerfile.buildroot's FROM quay.io/coreos-assembler/coreos-assembler:latest wasn't being replaced by the imagestream which triggered us in the first place. And what happened then is that we don't have a pull policy to always pull from upstream, so we were using a cached version of the image as the FROM layer.

I re-added those lines and restarted another cosa-buildroot build, then restarted CI here. I'm still not sure how those lines went missing though (but it's not the first time either that things are somehow out of sync).

cgwalters · 2020-05-08T12:38:57Z

Which meant that the Dockerfile.buildroot's FROM quay.io/coreos-assembler/coreos-assembler:latest wasn't being replaced by the imagestream which triggered us in the first place. And what happened then is that we don't have a pull policy to always pull from upstream, so we were using a cached version of the image as the FROM layer.

Oh man, that's just evil.

ashcrow · 2020-05-08T13:14:10Z

Which meant that the Dockerfile.buildroot's FROM quay.io/coreos-assembler/coreos-assembler:latest wasn't being replaced by the imagestream which triggered us in the first place. And what happened then is that we don't have a pull policy to always pull from upstream, so we were using a cached version of the image as the FROM layer.

Oh man, that's just evil.

Wow 😦 Agreed.

Great debugging work @jlebon!

I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out.

I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out. But don't error out if it somehow doesn't exist.

arithx approved these changes May 7, 2020

View reviewed changes

jlebon force-pushed the pr/override-ci-kernel branch from 5d48013 to 7edf7a3 Compare May 7, 2020 20:48

cgwalters merged commit ffc74f4 into coreos:master May 8, 2020

jlebon mentioned this pull request May 8, 2020

cosaPod: write out details about cosa version coreos/coreos-ci-lib#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Freeze kernel at 5.6.7 due to loop regression breaking blackbox test#976

ci: Freeze kernel at 5.6.7 due to loop regression breaking blackbox test#976
cgwalters merged 1 commit intocoreos:masterfrom
jlebon:pr/override-ci-kernel

jlebon commented May 7, 2020

Uh oh!

arithx left a comment

Uh oh!

jlebon commented May 7, 2020

Uh oh!

jlebon commented May 7, 2020

Uh oh!

jlebon commented May 7, 2020

Uh oh!

cgwalters commented May 8, 2020

Uh oh!

ashcrow commented May 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jlebon commented May 7, 2020

Uh oh!

arithx left a comment

Choose a reason for hiding this comment

Uh oh!

jlebon commented May 7, 2020

Uh oh!

jlebon commented May 7, 2020

Uh oh!

jlebon commented May 7, 2020

Uh oh!

cgwalters commented May 8, 2020

Uh oh!

ashcrow commented May 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants