Storage option changes in CRI-O configuration requires a reboot to be taken into account #8322

visheshtanksale · 2024-06-27T17:56:50Z

What happened?

Setup Kata using kata deploy on CRI-O.
When creating a test pod I get error below

Jun 21 11:51:02 ipp1-1848 crio[613259]: time="2024-06-21T11:51:02.647812257Z" level=error msg="createContainer failed" error="rpc error: code = Internal desc = the file /bin/bash was not found" name=containerd-shim-v2 pid=614221

If I try to bring up any other container using kata-qemu runtime I get similar error that the command which is entrypoint of the container is not found

Attached crio log here
Attached kata log here

Qemu and kata version are below

[Hypervisor]
  MachineType = "q35"
  Version = "QEMU emulator version 7.2.0 (kata-static)\nCopyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers"
  Path = "/opt/kata/bin/qemu-system-x86_64"
  BlockDeviceDriver = "virtio-scsi"
  EntropySource = "/dev/urandom"
  SharedFS = "virtio-fs"
  VirtioFSDaemon = "/opt/kata/libexec/virtiofsd"
  SocketPath = ""
  Msize9p = 8192
  MemorySlots = 10
  HotPlugVFIO = "no-port"
  ColdPlugVFIO = "no-port"
  PCIeRootPort = 0
  PCIeSwitchPort = 0
  Debug = true
  [Hypervisor.SecurityInfo]
    Rootless = false
    DisableSeccomp = false
    GuestHookPath = ""
    EnableAnnotations = ["enable_iommu", "virtio_fs_extra_args", "kernel_params"]
    ConfidentialGuest = false

[Runtime]
  Path = "/opt/kata/bin/kata-runtime"
  GuestSeLinuxLabel = ""
  Debug = true
  Trace = false
  DisableGuestSeccomp = true
  DisableNewNetNs = false
  SandboxCgroupOnly = false
  [Runtime.Config]
    Path = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
  [Runtime.Version]
    OCI = "1.1.0+dev"
    [Runtime.Version.Version]
      Semver = "3.5.0"
      Commit = "cce735a09e7374ee52a3b4f5d4a4923e9af07f73"
      Major = 3
      Minor = 5
      Patch = 0

Opened an issue on kata-containers
@littlejawa suggest adding the storage overlay config

[crio]
  storage_option = [
	"overlay.skip_mount_home=true",
  ]

But this doesnt help.

What did you expect to happen?

The pod should come up without error

How can we reproduce it (as minimally and precisely as possible)?

Install Kata deploy on a host with CRI-O with details mentioned here
Create a pod with kata-qemu runtime class

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
crio version 1.31.0
Version:        1.31.0
GitCommit:      004b5dc40823f9bce9b34c6da2a769778725c0f5
GitCommitDate:  2024-06-18T16:24:04Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.22.3
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
  exclude_graphdriver_devicemapper
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  true

$ kubectl version --output=json
{
  "clientVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.11",
    "gitCommit": "f25b321b9ae42cb1bfaa00b3eec9a12566a15d91",
    "gitTreeState": "clean",
    "buildDate": "2024-06-11T20:20:18Z",
    "goVersion": "go1.21.11",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3",
  "serverVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.11",
    "gitCommit": "f25b321b9ae42cb1bfaa00b3eec9a12566a15d91",
    "gitTreeState": "clean",
    "buildDate": "2024-06-11T20:11:29Z",
    "goVersion": "go1.21.11",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux ipp1-1848 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

The text was updated successfully, but these errors were encountered:

visheshtanksale · 2024-06-27T17:59:00Z

cc: @zvonkok

haircommander · 2024-06-27T18:28:12Z

@littlejawa is this something you're helping with or are you looking for reinforcements?

zvonkok · 2024-06-27T18:48:16Z

@haircommander Yes, he is helping with that, and we're currently out of options and need reinforcements.

haircommander · 2024-06-27T18:52:28Z

what happens when you create the container with a different oci runtime?

visheshtanksale · 2024-06-27T19:03:11Z

what happens when you create the container with a different oci runtime?

Non kata containers are created successfully.

littlejawa · 2024-06-28T08:13:11Z

The symptom is similar to what we saw with kata 3.3.0, where the content of the container's rootfs was not accessible to the runtime.
We fixed it in our own CI by adding the flag "storage.overlay.skip_mount_home=true" in crio's config.
I'm also fixing it in the same way in the crio CI for kata, in #7958.

In this cluster the flag was not there, so we added it, but it didn't solve the problem.
Could crio ignore the flag for some reason? What else could cause the same symptom?

fidencio · 2024-07-03T12:12:46Z

After some experiments from my side, this is what I learned.

[crio]
storage_option = [
  "overlay.skip_mount_home=true",
]

If ^^^ is set before kubernetes is deployed, we're good.
If ^^^ is set after kuberentes is deployed, restarting cri-o / kubelet does not solve the issue, although a full reboot does.

I'm also added the same comment to the Kata Containers issue.

littlejawa · 2024-07-09T08:29:26Z

Hey @haircommander,

I think we need your brain here :)

Crio was taking our change into account (according to its logs), but kata still couldn't access the files from the container rootfs, meaning that the mount was still wrong.
We managed to make the cluster work, by rebooting the node. Reloading / restarting crio multiple times didn't help.

Is it because the layers were already mounted with the wrong flag, and not updated as part of the reload/restart?
If so, is there anything else we could have done to make them remounted properly?

Is rebooting the node the right way to make this setting applied ?

haircommander · 2024-07-09T14:15:17Z

Is it because the layers were already mounted with the wrong flag, and not updated as part of the reload/restart?
If so, is there anything else we could have done to make them remounted properly?

yeah that makes sense to me. I think the only way to fix it would be to remove the containers and images. Rebooting is probably least intrusive

littlejawa · 2024-07-11T07:39:08Z

I see two things here:

This issue is not about kata. I can't edit the title, but I think it should be something like : "Storage option changes in crio config requires a reboot to be taken into account"
Do we want to fix it?
Removing all images/containers is not something that I expect CRI-O to do by itself on every reload/restart.
Even if we limit it to this specific kind of configuration change (assuming we can tell that it's a new setting) it can be very impactful.
On the other hand, being one of the guys who scratched their heads trying to understand what was going on, can we add some warning (maybe as comments in the conf file) to make sure people are aware they may need to reboot if they change it?

kwilczynski · 2024-07-11T07:50:53Z

/retitle Storage option changes in CRI-O configuration requires a reboot to be taken into account

kwilczynski · 2024-07-11T07:52:19Z

[...]

This issue is not about kata. I can't edit the title, but I think it should be something like : "Storage option changes in crio config requires a reboot to be taken into account"

@littlejawa, this is a restart of the guest virtual machine, correct? I hope that the host on which CRI-O runs does not require that.

littlejawa · 2024-07-11T08:25:23Z

No, we're talking about the host unfortunately.

The problem is as follows:

CRI-O runs with some storage options
you change those options (here: asking to skip the private bind mount) and restart/reload CRI-O
=> the change is not taken into account (at least not for existing images/containers, if I understand correctly).

The way to make it taken into account is to reboot the node.
That's bad, but the alternative seems to be: remove all images/containers... so maybe rebooting is the lesser of two evils :-(

github-actions · 2024-08-11T00:07:31Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2024-09-25T00:03:41Z

A friendly reminder that this issue had no activity for 30 days.

zvonkok · 2024-09-26T15:25:40Z

/remove-lifecycle-stale

visheshtanksale added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2024

kwilczynski assigned littlejawa and haircommander Jul 11, 2024

openshift-ci bot changed the title ~~Pod creation fails with CRI-O on kata-qemu runtime~~ Storage option changes in CRI-O configuration requires a reboot to be taken into account Jul 11, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2024

kwilczynski removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage option changes in CRI-O configuration requires a reboot to be taken into account #8322

Storage option changes in CRI-O configuration requires a reboot to be taken into account #8322

visheshtanksale commented Jun 27, 2024

visheshtanksale commented Jun 27, 2024

haircommander commented Jun 27, 2024

zvonkok commented Jun 27, 2024

haircommander commented Jun 27, 2024

visheshtanksale commented Jun 27, 2024

littlejawa commented Jun 28, 2024

fidencio commented Jul 3, 2024

littlejawa commented Jul 9, 2024

haircommander commented Jul 9, 2024

littlejawa commented Jul 11, 2024

kwilczynski commented Jul 11, 2024

kwilczynski commented Jul 11, 2024

littlejawa commented Jul 11, 2024

github-actions bot commented Aug 11, 2024

github-actions bot commented Sep 25, 2024

zvonkok commented Sep 26, 2024

Storage option changes in CRI-O configuration requires a reboot to be taken into account #8322

Storage option changes in CRI-O configuration requires a reboot to be taken into account #8322

Comments

visheshtanksale commented Jun 27, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

CRI-O and Kubernetes version

OS version

Additional environment details (AWS, VirtualBox, physical, etc.)

visheshtanksale commented Jun 27, 2024

haircommander commented Jun 27, 2024

zvonkok commented Jun 27, 2024

haircommander commented Jun 27, 2024

visheshtanksale commented Jun 27, 2024

littlejawa commented Jun 28, 2024

fidencio commented Jul 3, 2024

littlejawa commented Jul 9, 2024

haircommander commented Jul 9, 2024

littlejawa commented Jul 11, 2024

kwilczynski commented Jul 11, 2024

kwilczynski commented Jul 11, 2024

littlejawa commented Jul 11, 2024

github-actions bot commented Aug 11, 2024

github-actions bot commented Sep 25, 2024

zvonkok commented Sep 26, 2024