-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cri-o failing to restart after upgrading from v2.26 to v2.27 #11907
Comments
from the cri-o gitub repo... when switching container run-times (e.g from the process is as follows:
so we need to stop any running containers before starting cri-o with the new |
To make this even more fun. With Fedora CoreOS I end up getting the following (regardless of runc or crun)
Additionally, when going from Fedora CoreOS |
@schoentoon There's an issue & PR open to address that Fedora CoreOS error |
I have also hit that issue that I believe was not really announced in UPGRADE guide but it should be. I needed to go back to kubespray 2.26. So be aware if You use CRI-O with Cilium |
What happened?
When upgrading a cluster from v2.26 -> v2.27 the
container-engine/cri-o
role hangs indefinitely waiting for cri-o to start on the first cluster node being upgraded.^^^ hangs heres
What did you expect to happen?
cri-o should successfully upgrade
How can we reproduce it (as minimally and precisely as possible)?
when using cri-o as the container engine, upgrade a cluster from v2.26.0 -> v2.27.0. Upgrading the first node is the cluster should fail because cri-o does not upgrade.
OS
Linux 4.18.0-553.33.1.el8_10.x86_64 x86_64
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
Version of Ansible
(controller)
ansible [core 2.16.14]
config file = /custom/ansible.cfg
configured module search path = ['/custom/library', '/custom/kubespray/library']
ansible python module location = /opt/venv/lib64/python3.12/site-packages/ansible
ansible collection location = /opt/venv/ansible/collections
executable location = /opt/venv/bin/ansible
python version = 3.12.5 (main, Dec 3 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] (/opt/venv/bin/python3.12)
jinja version = 3.1.5
libyaml = True
Version of Python
Python 3.12.5 (controller)
Version of Kubespray (commit)
9ec9b3a
Network plugin used
cilium
Full inventory with variables
we use a custom inventory plugin. Here are the kubespray crio variables set when creating/upgrading a cluster:
Command used to invoke ansible
ansible-playbook -i custom_plugin.yaml --become-method=sudo --become --become-user root upgrade-cluster.yml
Output of ansible run
crio section of the ansible run logs (trying to upgrade the first node):
^^^ hangs here
journal logs for cri-o:
Anything else we need to know
Looks like the issue is caused by switching cri-o to use crun container runtime in crio 1.31 / kubespray v2.27.0
#11601
Checking the journal logs for cri-o on the node that failed to restart cri-o we can see that crio is failing to stop containers
These containers were started using the runc container runtime (and cri-o 1.30.x).
/etc/crio/config.json
before upgrade (kubespray v2.26.0 / crio 1.30.x / runc ):and after the upgrade ((kubespray v2.27.0 / crio 1.31.x / crun)::
the issue is occuring because the existing containers need to be stopped before updating
/etc/crio/config.json
. crio will stop the containers if the runc config used.From the
container-engine/cri-o
role...the first time crio will get restarted is here:
kubespray/roles/container-engine/cri-o/tasks/main.yaml
Line 230 in d2e51e7
This happens after config files are updated & the binary files for crio are updated... crio stop will use updated files to stop crio.
One suggestion for improving the role:
during upgrades, crio should be stopped with the version & config that originally started the containers.
this is a safer way to upgrade:
it's better to always stop/start crio in this role... this will always stop any running containers but it's a safer way to upgrade crio.
The text was updated successfully, but these errors were encountered: