-
Notifications
You must be signed in to change notification settings - Fork 461
[release-4.17] OCPBUGS-38292: controller: default to runc when upgrading clusters from 4.17 to 4.18 #4715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.17] OCPBUGS-38292: controller: default to runc when upgrading clusters from 4.17 to 4.18 #4715
Conversation
|
/retest |
b3e8e63 to
6b44497
Compare
ce4319d to
efde22d
Compare
|
@sohankunkerkar: This pull request references Jira Issue OCPBUGS-38292, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
efde22d to
7dedcc4
Compare
| continue | ||
| } | ||
| // Check if ContainerRuntimeConfig exists | ||
| managedKeyForCtr, err := getManagedKeyCtrCfg(pool, ctrl.client, cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getManagedKeyCtrCfg returns the MC name of passed in cfg object. It can be one of existing MC or a new MC that does not exist.
There may be multiple containerruntimeconfig objects and only the last one takes effect.
I think for checking the current runtime configuration, we have to traverse the existing 99-pool-generated-containerruntimeconfig-x machineconfigs in reverse order and check the first existing DefaultRuntime configuration. @yuqi-zhang What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess this is a scenario where the design of the CRCC is a bit weird. Thinking through this scenario:
- the user provides their own runtime pin, which generates into
99-pool-generated-containerruntimeconfigand has contents for/etc/crio/crio.conf.d/01-ctrcfg-defaultRuntime - the user then adds a second config that sets e.g. pidLimit, which then translates to
/etc/crio/crio.conf.d/01-ctrcfg-pidsLimitas99-pool-generated-containerruntimeconfig-2
Technically, the way we frame kubelet/containerruntimeconfigs, we don't merge them and we now 2 machineconfigs, but both containerruntimeconfigs still exist and are taking effect since they define different files. If I were to ever unpin via delete the runtime default, I guess it still works since 99-pool-generated-containerruntimeconfig should get deleted alongside it?
Then in that case I guess we do have to parse through all the existing configuration... either through parsing all MCs that exist or all containerruntimeconfigs that exist (assuming that's the only way you can set a default runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were to ever unpin via delete the runtime default, I guess it still works since 99-pool-generated-containerruntimeconfig should get deleted alongside it?
Yes, 99-pool-generated-containerruntimeconfig will be deleted automatically when the corresponding ContainerRuntimeConfig objects are deleted.
| } | ||
|
|
||
| // create the MC for the drop in default-container-runtime crio.conf file | ||
| if err := ctrl.createDefaultContainerRuntimeMC(cfg); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would not get a cfg object if the key passed to syncContainerRuntimeConfig is forceSyncOnUpgrade. syncContainerRuntimeConfig would return before this line because the ContainerRuntimeConfig does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so the best way here is to query an API and get the cfg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think if we have to fetch the machineconfigs to check if the existing runtime configuration has been set on the cluster we don't neet the cfg argument for createDefaultContainerRuntimeMC
|
Could you clarify the purpose of this PR? My understanding is:
|
I think the idea here is to capture the following scenarios:
And for other cases where the cluster version is < 4.17, I have already confirmed with the OTA team that all updates will go through this change, regardless of the cluster's previous state. |
7458888 to
aa124e4
Compare
|
/retest |
24abfa3 to
736bd12
Compare
|
/test unit |
QiWang19
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should work fine, but we will need to perform some manual upgrade testing to ensure everything functions correctly.
pkg/controller/container-runtime-config/container_runtime_config_controller.go
Outdated
Show resolved
Hide resolved
736bd12 to
84fbe70
Compare
|
/test e2e-gcp-op |
|
@sohankunkerkar: This pull request references Jira Issue OCPBUGS-38292, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2 similar comments
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
|
@sohankunkerkar: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
d7c30c8
into
openshift:release-4.17
|
@sohankunkerkar: Jira Issue OCPBUGS-38292: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-38292 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] Distgit: ose-machine-config-operator |
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095) This was worked around temporarily in openshift#5308, but that workaround was not intendd to be carried in to 4.21. Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do. This was entirely based on openshift#4715 Signed-off-by: Peter Hunt <pehunt@redhat.com>
We need to handle the following cases: