cri-o failing to restart after upgrading from v2.26 to v2.27 #11907

pedro-peter · 2025-01-21T16:35:11Z

What happened?

When upgrading a cluster from v2.26 -> v2.27 the container-engine/cri-o role hangs indefinitely waiting for cri-o to start on the first cluster node being upgraded.

TASK [container-engine/cri-o : Cri-o | ensure crio service is started and enabled] ***
ok: [my-control01]
Monday 20 January 2025  17:02:53 +0000 (0:00:00.731)       0:07:23.256 ******** 
Monday 20 January 2025  17:02:53 +0000 (0:00:00.731)       0:07:23.255 ********

^^^ hangs heres

What did you expect to happen?

cri-o should successfully upgrade

How can we reproduce it (as minimally and precisely as possible)?

when using cri-o as the container engine, upgrade a cluster from v2.26.0 -> v2.27.0. Upgrading the first node is the cluster should fail because cri-o does not upgrade.

OS

Linux 4.18.0-553.33.1.el8_10.x86_64 x86_64
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

Version of Ansible

(controller)

ansible [core 2.16.14]
config file = /custom/ansible.cfg
configured module search path = ['/custom/library', '/custom/kubespray/library']
ansible python module location = /opt/venv/lib64/python3.12/site-packages/ansible
ansible collection location = /opt/venv/ansible/collections
executable location = /opt/venv/bin/ansible
python version = 3.12.5 (main, Dec 3 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] (/opt/venv/bin/python3.12)
jinja version = 3.1.5
libyaml = True

Version of Python

Python 3.12.5 (controller)

Version of Kubespray (commit)

9ec9b3a

Network plugin used

cilium

Full inventory with variables

we use a custom inventory plugin. Here are the kubespray crio variables set when creating/upgrading a cluster:

# CRI-O settings
container_manager: crio
crio_add_repos: false
download_container: false
skip_downloads: false
crio_pids_limit: 4096

Command used to invoke ansible

ansible-playbook -i custom_plugin.yaml --become-method=sudo --become --become-user root upgrade-cluster.yml

Output of ansible run

crio section of the ansible run logs (trying to upgrade the first node):

TASK [container-engine/cri-o : Cri-o | include vars/v1.29.yml] *****************
ok: [my-control01]
Monday 20 January 2025  17:01:59 +0000 (0:00:00.126)       0:06:28.949 ******** 
Monday 20 January 2025  17:01:59 +0000 (0:00:00.126)       0:06:28.949 ******** 
TASK [container-engine/cri-o : Cri-o | include vars/v1.31.yml] *****************
ok: [my-control01]
Monday 20 January 2025  17:01:59 +0000 (0:00:00.117)       0:06:29.067 ******** 
Monday 20 January 2025  17:01:59 +0000 (0:00:00.117)       0:06:29.066 ******** 
TASK [container-engine/cri-o : Cri-o | check if fedora coreos] *****************
ok: [my-control01]
Monday 20 January 2025  17:02:00 +0000 (0:00:00.387)       0:06:29.454 ******** 
Monday 20 January 2025  17:02:00 +0000 (0:00:00.387)       0:06:29.461 ******** 
TASK [container-engine/cri-o : Cri-o | set is_ostree] **************************
ok: [my-control01]
Monday 20 January 2025  17:02:00 +0000 (0:00:00.455)       0:06:29.910 ******** 
Monday 20 January 2025  17:02:00 +0000 (0:00:00.448)       0:06:29.910 ******** 
Monday 20 January 2025  17:02:00 +0000 (0:00:00.231)       0:06:30.142 ******** 
Monday 20 January 2025  17:02:00 +0000 (0:00:00.256)       0:06:30.167 ******** 
TASK [container-engine/cri-o : Cri-o | Download cri-o] *************************
included: .//roles/container-engine/cri-o/tasks/../../../download/tasks/download_file.yml for my-control01
Monday 20 January 2025  17:02:00 +0000 (0:00:00.179)       0:06:30.321 ******** 
Monday 20 January 2025  17:02:00 +0000 (0:00:00.154)       0:06:30.321 ******** 
TASK [container-engine/cri-o : Prep_download | Set a few facts] ****************
ok: [my-control01]
Monday 20 January 2025  17:02:03 +0000 (0:00:02.166)       0:06:32.488 ******** 
Monday 20 January 2025  17:02:03 +0000 (0:00:02.166)       0:06:32.487 ******** 
TASK [container-engine/cri-o : Download_file | Show url of file to download] ***
ok: [my-control01] => {
    "msg": "[redacted]]/cri-o.amd64.v1.31.0.tar.gz"
}
Monday 20 January 2025  17:02:07 +0000 (0:00:04.827)       0:06:37.316 ******** 
Monday 20 January 2025  17:02:07 +0000 (0:00:04.828)       0:06:37.316 ******** 
TASK [container-engine/cri-o : Download_file | Set pathname of cached file] ****
ok: [my-control01]
Monday 20 January 2025  17:02:12 +0000 (0:00:04.287)       0:06:41.604 ******** 
Monday 20 January 2025  17:02:12 +0000 (0:00:04.289)       0:06:41.605 ******** 
TASK [container-engine/cri-o : Download_file | Create dest directory on node] ***
ok: [my-control01]
Monday 20 January 2025  17:02:16 +0000 (0:00:04.486)       0:06:46.091 ******** 
Monday 20 January 2025  17:02:16 +0000 (0:00:04.485)       0:06:46.090 ******** 
Monday 20 January 2025  17:02:16 +0000 (0:00:00.104)       0:06:46.196 ******** 
Monday 20 January 2025  17:02:16 +0000 (0:00:00.104)       0:06:46.195 ******** 
Monday 20 January 2025  17:02:17 +0000 (0:00:00.195)       0:06:46.392 ******** 
Monday 20 January 2025  17:02:17 +0000 (0:00:00.195)       0:06:46.391 ******** 
TASK [container-engine/cri-o : Download_file | Download item] ******************
ok: [my-control01]
Monday 20 January 2025  17:02:28 +0000 (0:00:11.470)       0:06:57.862 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:11.470)       0:06:57.861 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.083)       0:06:57.945 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.083)       0:06:57.945 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.083)       0:06:58.029 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.083)       0:06:58.028 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.082)       0:06:58.112 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.083)       0:06:58.112 ******** 
TASK [container-engine/cri-o : Download_file | Extract file archives] **********
included: .//roles/download/tasks/extract_file.yml for my-control01
Monday 20 January 2025  17:02:28 +0000 (0:00:00.132)       0:06:58.244 ******** 
Monday 20 January 2025  17:02:28 +0000 (0:00:00.131)       0:06:58.243 ******** 
TASK [container-engine/cri-o : Extract_file | Unpacking archive] ***************
ok: [my-control01]
Monday 20 January 2025  17:02:39 +0000 (0:00:10.758)       0:07:09.002 ******** 
Monday 20 January 2025  17:02:39 +0000 (0:00:10.758)       0:07:09.001 ******** 
Monday 20 January 2025  17:02:39 +0000 (0:00:00.075)       0:07:09.077 ******** 
Monday 20 January 2025  17:02:39 +0000 (0:00:00.075)       0:07:09.077 ******** 
Monday 20 January 2025  17:02:39 +0000 (0:00:00.286)       0:07:09.364 ******** 
Monday 20 January 2025  17:02:39 +0000 (0:00:00.286)       0:07:09.363 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.079)       0:07:09.443 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.079)       0:07:09.443 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.076)       0:07:09.520 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.076)       0:07:09.519 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.074)       0:07:09.594 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.074)       0:07:09.594 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.075)       0:07:09.670 ******** 
Monday 20 January 2025  17:02:40 +0000 (0:00:00.075)       0:07:09.670 ******** 
TASK [container-engine/cri-o : Cri-o | make sure needed folders exist in the system] ***
ok: [my-control01] => (item=/etc/crio)
ok: [my-control01] => (item=/etc/containers)
ok: [my-control01] => (item=/etc/systemd/system/crio.service.d)
Monday 20 January 2025  17:02:41 +0000 (0:00:00.879)       0:07:10.549 ******** 
Monday 20 January 2025  17:02:41 +0000 (0:00:00.879)       0:07:10.549 ******** 
TASK [container-engine/cri-o : Cri-o | install cri-o config] *******************
changed: [my-control01]
Monday 20 January 2025  17:02:42 +0000 (0:00:00.842)       0:07:11.392 ******** 
Monday 20 January 2025  17:02:42 +0000 (0:00:00.842)       0:07:11.392 ******** 
TASK [container-engine/cri-o : Cri-o | install config.json] ********************
ok: [my-control01]
Monday 20 January 2025  17:02:42 +0000 (0:00:00.770)       0:07:12.163 ******** 
Monday 20 January 2025  17:02:42 +0000 (0:00:00.770)       0:07:12.163 ******** 
TASK [container-engine/cri-o : Cri-o | copy binaries] **************************
changed: [my-control01] => (item=crio)
changed: [my-control01] => (item=pinns)
Monday 20 January 2025  17:02:43 +0000 (0:00:01.115)       0:07:13.279 ******** 
Monday 20 January 2025  17:02:43 +0000 (0:00:01.116)       0:07:13.279 ******** 
TASK [container-engine/cri-o : Cri-o | create directory for libexec] ***********
changed: [my-control01]
Monday 20 January 2025  17:02:44 +0000 (0:00:00.341)       0:07:13.620 ******** 
Monday 20 January 2025  17:02:44 +0000 (0:00:00.340)       0:07:13.620 ******** 
TASK [container-engine/cri-o : Cri-o | copy libexec] ***************************
changed: [my-control01] => (item=conmon)
changed: [my-control01] => (item=conmonrs)
changed: [my-control01] => (item=crun)
changed: [my-control01] => (item=runc)
Monday 20 January 2025  17:02:45 +0000 (0:00:01.629)       0:07:15.249 ******** 
Monday 20 January 2025  17:02:45 +0000 (0:00:01.629)       0:07:15.249 ******** 
TASK [container-engine/cri-o : Cri-o | copy service file] **********************
ok: [my-control01]
Monday 20 January 2025  17:02:46 +0000 (0:00:00.453)       0:07:15.703 ******** 
Monday 20 January 2025  17:02:46 +0000 (0:00:00.453)       0:07:15.702 ******** 
TASK [container-engine/cri-o : Cri-o | configure crio to use kube reserved cgroups] ***
ok: [my-control01]
Monday 20 January 2025  17:02:47 +0000 (0:00:00.709)       0:07:16.413 ******** 
Monday 20 January 2025  17:02:47 +0000 (0:00:00.710)       0:07:16.412 ******** 
TASK [container-engine/cri-o : Cri-o | update the bin dir for crio.service file] ***
ok: [my-control01]
Monday 20 January 2025  17:02:47 +0000 (0:00:00.498)       0:07:16.911 ******** 
Monday 20 January 2025  17:02:47 +0000 (0:00:00.498)       0:07:16.911 ******** 
TASK [container-engine/cri-o : Cri-o | copy default policy] ********************
ok: [my-control01]
Monday 20 January 2025  17:02:47 +0000 (0:00:00.443)       0:07:17.355 ******** 
Monday 20 January 2025  17:02:47 +0000 (0:00:00.443)       0:07:17.354 ******** 
TASK [container-engine/cri-o : Cri-o | copy mounts.conf] ***********************
ok: [my-control01]
Monday 20 January 2025  17:02:48 +0000 (0:00:00.680)       0:07:18.035 ******** 
Monday 20 January 2025  17:02:48 +0000 (0:00:00.680)       0:07:18.035 ******** 
TASK [container-engine/cri-o : Cri-o | create directory for oci hooks] *********
ok: [my-control01]
Monday 20 January 2025  17:02:49 +0000 (0:00:00.344)       0:07:18.380 ******** 
Monday 20 January 2025  17:02:49 +0000 (0:00:00.344)       0:07:18.379 ******** 
TASK [container-engine/cri-o : Cri-o | set overlay driver] *********************
ok: [my-control01] => (item={'option': 'driver', 'value': '"overlay"'})
ok: [my-control01] => (item={'option': 'graphroot', 'value': '"/var/lib/containers/storage"'})
ok: [my-control01] => (item={'option': 'runroot', 'value': '"/var/run/containers/storage"'})
Monday 20 January 2025  17:02:49 +0000 (0:00:00.877)       0:07:19.257 ******** 
Monday 20 January 2025  17:02:49 +0000 (0:00:00.877)       0:07:19.256 ******** 
TASK [container-engine/cri-o : Cri-o | set metacopy mount options correctly] ***
ok: [my-control01]
Monday 20 January 2025  17:02:50 +0000 (0:00:00.345)       0:07:19.602 ******** 
Monday 20 January 2025  17:02:50 +0000 (0:00:00.345)       0:07:19.602 ******** 
TASK [container-engine/cri-o : Cri-o | create directory registries configs] ****
ok: [my-control01]
Monday 20 January 2025  17:02:50 +0000 (0:00:00.340)       0:07:19.943 ******** 
Monday 20 January 2025  17:02:50 +0000 (0:00:00.340)       0:07:19.942 ******** 
TASK [container-engine/cri-o : Cri-o | write registries configs] ***************
ok: [my-control01] => (item={'blocked': False, 'insecure': False, 'location': 'redacted', 'prefix': 'redacted', 'unqualified': False})
Monday 20 January 2025  17:02:51 +0000 (0:00:00.743)       0:07:20.687 ******** 
Monday 20 January 2025  17:02:51 +0000 (0:00:00.744)       0:07:20.686 ******** 
TASK [container-engine/cri-o : Cri-o | configure unqualified registry settings] ***
ok: [my-control01]
Monday 20 January 2025  17:02:52 +0000 (0:00:00.722)       0:07:21.410 ******** 
Monday 20 January 2025  17:02:52 +0000 (0:00:00.722)       0:07:21.409 ******** 
Monday 20 January 2025  17:02:52 +0000 (0:00:00.085)       0:07:21.495 ******** 
Monday 20 January 2025  17:02:52 +0000 (0:00:00.085)       0:07:21.495 ******** 
TASK [container-engine/cri-o : Cri-o | configure the uid/gid space for user namespaces] ***
ok: [my-control01] => (item=/etc/subuid)
ok: [my-control01] => (item=/etc/subgid)
Monday 20 January 2025  17:02:53 +0000 (0:00:01.028)       0:07:22.524 ******** 
Monday 20 January 2025  17:02:53 +0000 (0:00:01.028)       0:07:22.523 ******** 
TASK [container-engine/cri-o : Cri-o | ensure crio service is started and enabled] ***
ok: [my-control01]
Monday 20 January 2025  17:02:53 +0000 (0:00:00.731)       0:07:23.256 ******** 
Monday 20 January 2025  17:02:53 +0000 (0:00:00.731)       0:07:23.255 ********

^^^ hangs here

journal logs for cri-o:

Jan 20 08:35:26 kmycontrol01systemd[1]: Stopping Container Runtime Interface for OCI (CRI-O)...
Jan 20 08:35:26 kmycontrol01systemd[1]: crio.service: Succeeded.
Jan 20 08:35:26 kmycontrol01systemd[1]: Stopped Container Runtime Interface for OCI (CRI-O).
Jan 20 08:35:26 kmycontrol01systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.893291575-08:00" level=info msg="Updating config from single file: /etc/crio/crio.conf"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.893319329-08:00" level=info msg="Updating config from drop-in file: /etc/crio/crio.conf"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.894065779-08:00" level=info msg="Updating config from path: /etc/crio/crio.conf.d"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.894107491-08:00" level=info msg="Updating config from drop-in file: /etc/crio/crio.conf.d/10-snc.conf"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.905848450-08:00" level=info msg="Starting CRI-O, version: 1.31.0, git: ac758bb6183ef69cd47c663eb916953217a97fb3(dirty)"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.905945246-08:00" level=info msg="Node configuration value for hugetlb cgroup is true"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.905955723-08:00" level=info msg="Node configuration value for pid cgroup is true"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.905969099-08:00" level=info msg="Node configuration value for memoryswap cgroup is true"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.905975139-08:00" level=info msg="Node configuration value for cgroup v2 is false"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.912894734-08:00" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.913240529-08:00" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.914097531-08:00" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.919709798-08:00" level=info msg="Using runtime handler crun version 1.17, commit: 000fa0d4eeed8938301f3bcf8206405315bc1017, rundir: /run/crun, spec: 1.0.0, +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL"
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.942145657-08:00" level=warning msg="Runtime handler \"crun\" supports Recursive Read-only (RRO) mounts, but kernel does not: kernel version \"4.18.0-553.33.1.el8_10.x86_64\" does not support recursive read-only mounts: unable to set recursive read-only mount attribute: function not implemented"
...
Jan 20 08:35:26 kmycontrol01crio[386160]: time="2025-01-20 08:35:26.975861157-08:00" level=info msg="Restore irqbalance config: nothing to do"
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.002755776-08:00" level=error msg="Failed to update container state for d1a5064c67987f2546daa00e34504684f8ac7c534fbd3bb79c2c151f5087af92: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.010589678-08:00" level=error msg="Failed to update container state for e044d398dadfdd45ab27ebb1dbfc9a1a72b25327b6368f60c1592369adefde50: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.020629447-08:00" level=error msg="Failed to update container state for 71a9e5fa065e41703af7d793a1a2c872b4626e644c47455180e1cb7a6d875c06: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.031550771-08:00" level=error msg="Failed to update container state for b82892d54aaad9e8b9ce110429c23ffd345e39980098ded9f7731f1f2cce640f: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.039525514-08:00" level=error msg="Failed to update container state for d954f2e5afc06d23f1f6c44bccd74f6a669a668ea33040a43bd854cce0c89fb7: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.048572920-08:00" level=error msg="Failed to update container state for ee6f0905a82501cdaf1e8c2fe849832c93a528832f3cd1f567b4c94a0aaa6d98: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.058574462-08:00" level=error msg="Failed to update container state for be1e1f8f17c7b724b08c60f50446b16c0495f86a97299e67b75f2f19339a82f2: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.067588982-08:00" level=error msg="Failed to update container state for d7995fd9450f861908fd4277611e885a0b0f03634382afabfa972e505987f5e2: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.075598648-08:00" level=error msg="Failed to update container state for 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.084582695-08:00" level=error msg="Failed to update container state for 3a85ba83b30c7570d1d11d52f5a1a33f5c290882eb3838c01e2dccfde862ae29: stdout: , stderr: "
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.086791245-08:00" level=info msg="Got pod network &{Name:csi-secrets-store-ljz52 Namespace:secrets-store-csi-driver ID:935875902f67dcbe81377b97e370ac5acec14c1fc0e87bbdbd1aef3f85b55edb UID:12b63be1-b43f-49c0-b9d3-ea340d748f34 NetNS:/var/run/netns/2dfadc51-7cc1-4ced-a112-7f5c050c6d35 Networks:[] RuntimeConfig:map[cilium:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[] CgroupPath:kubepods-burstable-pod12b63be1_b43f_49c0_b9d3_ea340d748f34.slice PodAnnotations:0xc0003680e0}] Aliases:map[]}"
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.086945425-08:00" level=info msg="Checking pod secrets-store-csi-driver_csi-secrets-store-ljz52 for CNI network cilium (type=cilium-cni)"
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.087350785-08:00" level=info msg="Got pod network &{Name:dcps-csi-provider-gvnts Namespace:dcps-csi-provider ID:5915412ef4f9851239c2b7c3cd3c65e5b7c4fbc599e2d681abc009eb2d950bb7 UID:5b03c944-cd69-4824-aa56-d628212fff1b NetNS:/var/run/netns/d9a57870-8192-4447-b378-43e8f40e6dc5 Networks:[] RuntimeConfig:map[cilium:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[] CgroupPath:kubepods-burstable-pod5b03c944_cd69_4824_aa56_d628212fff1b.slice PodAnnotations:0xc000368400}] Aliases:map[]}"
Jan 20 08:35:27 kmycontrol01crio[386160]: time="2025-01-20 08:35:27.087447380-08:00" level=info msg="Checking pod dcps-csi-provider_dcps-csi-provider-gvnts for CNI network cilium (type=cilium-cni)"
Jan 20 08:35:37 kmycontrol01crio[386160]: time="2025-01-20 08:35:37.094959729-08:00" level=warning msg="Stopping container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b with stop signal timed out. Killing..."
Jan 20 08:35:37 kmycontrol01crio[386160]: time="2025-01-20 08:35:37.104648105-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:37 kmycontrol01crio[386160]: time="2025-01-20 08:35:37.152662142-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:37 kmycontrol01crio[386160]: time="2025-01-20 08:35:37.228645602-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:37 kmycontrol01crio[386160]: time="2025-01-20 08:35:37.390666915-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:37 kmycontrol01crio[386160]: time="2025-01-20 08:35:37.750679932-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:38 kmycontrol01crio[386160]: time="2025-01-20 08:35:38.240652604-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:39 kmycontrol01crio[386160]: time="2025-01-20 08:35:39.023676959-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:40 kmycontrol01crio[386160]: time="2025-01-20 08:35:40.747709602-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:43 kmycontrol01crio[386160]: time="2025-01-20 08:35:43.926698122-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:35:54 kmycontrol01crio[386160]: time="2025-01-20 08:35:54.332638522-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:36:06 kmycontrol01crio[386160]: time="2025-01-20 08:36:06.708708190-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:36:33 kmycontrol01crio[386160]: time="2025-01-20 08:36:33.858679820-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:37:45 kmycontrol01crio[386160]: time="2025-01-20 08:37:45.340643188-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:39:57 kmycontrol01crio[386160]: time="2025-01-20 08:39:57.214643047-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:42:46 kmycontrol01crio[386160]: time="2025-01-20 08:42:46.249699128-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:46:55 kmycontrol01crio[386160]: time="2025-01-20 08:46:55.733729562-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:51:24 kmycontrol01crio[386160]: time="2025-01-20 08:51:24.049701471-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:54:11 kmycontrol01crio[386160]: time="2025-01-20 08:54:11.327661079-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"
Jan 20 08:57:40 kmycontrol01crio[386160]: time="2025-01-20 08:57:40.438696847-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"

Anything else we need to know

Looks like the issue is caused by switching cri-o to use crun container runtime in crio 1.31 / kubespray v2.27.0
#11601

Checking the journal logs for cri-o on the node that failed to restart cri-o we can see that crio is failing to stop containers

Jan 20 08:57:40 kmycontrol01crio[386160]: time="2025-01-20 08:57:40.438696847-08:00" level=error msg="Killing container 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b failed: `/usr/libexec/crio/crun --root /run/crun --systemd-cgroup kill 5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b KILL` failed: error opening file `/run/crun/5af774e767324ab80e6841f4e51f8a7e04ca00fa72a4912fabf402449002739b/status`: No such file or directory\n : exit status 1"

These containers were started using the runc container runtime (and cri-o 1.30.x).

/etc/crio/config.json before upgrade (kubespray v2.26.0 / crio 1.30.x / runc ):

[crio.runtime.runtimes.runc]
runtime_path = "/usr/local/bin/runc"
runtime_type = "oci"
runtime_root = "/run/runc"
privileged_without_host_devices = false
allowed_annotations = []

and after the upgrade ((kubespray v2.27.0 / crio 1.31.x / crun)::

...
[crio.runtime.runtimes.crun]
runtime_path = "/usr/libexec/crio/crun"
runtime_type = "oci"
runtime_root = "/run/crun"
privileged_without_host_devices = false
allowed_annotations = []

the issue is occuring because the existing containers need to be stopped before updating /etc/crio/config.json. crio will stop the containers if the runc config used.

From the container-engine/cri-o role...
the first time crio will get restarted is here:

kubespray/roles/container-engine/cri-o/tasks/main.yaml

Line 230 in d2e51e7

- name: Cri-o | ensure crio service is started and enabled

This happens after config files are updated & the binary files for crio are updated... crio stop will use updated files to stop crio.

One suggestion for improving the role:

during upgrades, crio should be stopped with the version & config that originally started the containers.

this is a safer way to upgrade:

stop crio if it's installed and running
update config file and crio binaray files
start crio and ensure it's running

it's better to always stop/start crio in this role... this will always stop any running containers but it's a safer way to upgrade crio.

The text was updated successfully, but these errors were encountered:

pedro-peter · 2025-01-22T15:36:14Z

from the cri-o gitub repo...
cri-o/cri-o#8705 (comment)

when switching container run-times (e.g from runc -> crun):

the process is as follows:

kubectl cordon

kubectl drain --ignore-daemonsets --delete-emptydir-data

systemctl stop kubelet

Ensure that containers are correctly stopped

for p in $(crictl pods -q); do if [[ "$(crictl inspectp $p | jq -r .status.linux.namespaces.options.network)" != "NODE" ]]; then crictl rmp -f $p; fi; done

crictl rmp -fa

systemctl stop crio

Update CRI-O here at this stage

systemctl restart crio

systemctl restart kubelet

kubectl uncordon

so we need to stop any running containers before starting cri-o with the new crio.conf config file

schoentoon · 2025-02-02T02:07:46Z

To make this even more fun. With Fedora CoreOS I end up getting the following (regardless of runc or crun)

TASK [container-engine/cri-o : Cri-o | copy libexec] ********************************************************************************************************************************************************************************************************************************************************************************************************************
failed: [control-plane] (item=conmon) => {"ansible_loop_var": "item", "changed": false, "item": "conmon", "msg": "Destination /usr/libexec/crio not writable"}
failed: [control-plane] (item=conmonrs) => {"ansible_loop_var": "item", "changed": false, "item": "conmonrs", "msg": "Destination /usr/libexec/crio not writable"}
failed: [control-plane] (item=crun) => {"ansible_loop_var": "item", "changed": false, "item": "crun", "msg": "Destination /usr/libexec/crio not writable"}
failed: [control-plane] (item=runc) => {"ansible_loop_var": "item", "changed": false, "item": "runc", "msg": "Destination /usr/libexec/crio not writable"}

Additionally, when going from Fedora CoreOS 39.20240407.3.0 to 40.20240701.3.0 also already stops cri-o from starting properly.

pedro-peter · 2025-02-04T10:37:39Z

@schoentoon There's an issue & PR open to address that Fedora CoreOS error
#11715

BartoszGiza · 2025-02-05T08:52:38Z

I have also hit that issue that I believe was not really announced in UPGRADE guide but it should be.
Another issue that failed in my case is that with 1.31 and CRI-O Cilium agent can't start. It fails with some weird error that cilium can't find signal map

I needed to go back to kubespray 2.26. So be aware if You use CRI-O with Cilium

pedro-peter added the kind/bug Categorizes issue or PR as related to a bug. label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cri-o failing to restart after upgrading from v2.26 to v2.27 #11907

cri-o failing to restart after upgrading from v2.26 to v2.27 #11907

pedro-peter commented Jan 21, 2025

pedro-peter commented Jan 22, 2025 •

edited

Loading

schoentoon commented Feb 2, 2025

pedro-peter commented Feb 4, 2025

BartoszGiza commented Feb 5, 2025

cri-o failing to restart after upgrading from v2.26 to v2.27 #11907

cri-o failing to restart after upgrading from v2.26 to v2.27 #11907

Comments

pedro-peter commented Jan 21, 2025

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

pedro-peter commented Jan 22, 2025 • edited Loading

schoentoon commented Feb 2, 2025

pedro-peter commented Feb 4, 2025

BartoszGiza commented Feb 5, 2025

pedro-peter commented Jan 22, 2025 •

edited

Loading