Skip to content

Conversation

@sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Nov 22, 2024

We need to handle the following cases:

  • The user has already set the default runtime and then updates to 4.17.z with this change.
  • The user updates to 4.17.z with this change and then tries setting the default runtime of their choice.
  • The user updates from 4.17.z with the defaulting logic to 4.18 and then wants to set the runtime.

@sohankunkerkar
Copy link
Member Author

/retest

@openshift-ci openshift-ci bot requested review from mtrmac and wgahnagl November 22, 2024 20:36
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 23, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 25, 2024
@sohankunkerkar sohankunkerkar force-pushed the up-release-4.17 branch 2 times, most recently from ce4319d to efde22d Compare November 25, 2024 14:19
@sohankunkerkar sohankunkerkar changed the title crio: skip MC creation if the containerruntimeconfig already exists OCPBUGS-38292: controller: default to runc when upgrading clusters from 4.17 to 4.18 Nov 25, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 25, 2024
@openshift-ci-robot
Copy link
Contributor

@sohankunkerkar: This pull request references Jira Issue OCPBUGS-38292, which is invalid:

  • expected Jira Issue OCPBUGS-38292 to depend on a bug targeting a version in 4.18.0 and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

We need to handle a case where users are upgrading from 4.16 to 4.17, if the containerruntimeconfig already exists, then MCO will not create an MC to set the default runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

continue
}
// Check if ContainerRuntimeConfig exists
managedKeyForCtr, err := getManagedKeyCtrCfg(pool, ctrl.client, cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getManagedKeyCtrCfg returns the MC name of passed in cfg object. It can be one of existing MC or a new MC that does not exist.
There may be multiple containerruntimeconfig objects and only the last one takes effect.
I think for checking the current runtime configuration, we have to traverse the existing 99-pool-generated-containerruntimeconfig-x machineconfigs in reverse order and check the first existing DefaultRuntime configuration. @yuqi-zhang What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess this is a scenario where the design of the CRCC is a bit weird. Thinking through this scenario:

  1. the user provides their own runtime pin, which generates into 99-pool-generated-containerruntimeconfig and has contents for /etc/crio/crio.conf.d/01-ctrcfg-defaultRuntime
  2. the user then adds a second config that sets e.g. pidLimit, which then translates to /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit as 99-pool-generated-containerruntimeconfig-2

Technically, the way we frame kubelet/containerruntimeconfigs, we don't merge them and we now 2 machineconfigs, but both containerruntimeconfigs still exist and are taking effect since they define different files. If I were to ever unpin via delete the runtime default, I guess it still works since 99-pool-generated-containerruntimeconfig should get deleted alongside it?

Then in that case I guess we do have to parse through all the existing configuration... either through parsing all MCs that exist or all containerruntimeconfigs that exist (assuming that's the only way you can set a default runtime)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to ever unpin via delete the runtime default, I guess it still works since 99-pool-generated-containerruntimeconfig should get deleted alongside it?

Yes, 99-pool-generated-containerruntimeconfig will be deleted automatically when the corresponding ContainerRuntimeConfig objects are deleted.

}

// create the MC for the drop in default-container-runtime crio.conf file
if err := ctrl.createDefaultContainerRuntimeMC(cfg); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would not get a cfg object if the key passed to syncContainerRuntimeConfig is forceSyncOnUpgrade. syncContainerRuntimeConfig would return before this line because the ContainerRuntimeConfig does not exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so the best way here is to query an API and get the cfg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think if we have to fetch the machineconfigs to check if the existing runtime configuration has been set on the cluster we don't neet the cfg argument for createDefaultContainerRuntimeMC

@QiWang19
Copy link
Member

Could you clarify the purpose of this PR? My understanding is:

  • If the cluster sets a default runtime to crun, we don't create a MachineConfig to switch to runc.
  • If the cluster does not set a runtime configuration, we default to runc.
  • If the cluster explicitly sets the runtime to runc, we leave the default as runc without making changes.

@sohankunkerkar
Copy link
Member Author

Could you clarify the purpose of this PR? My understanding is:

  • If the cluster sets a default runtime to crun, we don't create a MachineConfig to switch to runc.
  • If the cluster does not set a runtime configuration, we default to runc.
  • If the cluster explicitly sets the runtime to runc, we leave the default as runc without making changes.

I think the idea here is to capture the following scenarios:

  • If the user has already set the default_runtime to crun via the container runtime config (not possible via an MC because we are making crun the default in 4.18) and then updates to 4.17.z with this change, the cluster should retain the default_runtime set by the user.
  • If the user updates to 4.17.z with this change and then tries setting the default_runtime to crun, it should work.
  • If the user updates from 4.17.z with the defaulting logic to 4.18 and then wants to set the default_runtime to runc, they will need to delete the MC manually, as we will not be handling the auto-deletion logic in MCO.

And for other cases where the cluster version is < 4.17, I have already confirmed with the OTA team that all updates will go through this change, regardless of the cluster's previous state.

@sohankunkerkar sohankunkerkar force-pushed the up-release-4.17 branch 3 times, most recently from 7458888 to aa124e4 Compare November 26, 2024 22:33
@sohankunkerkar
Copy link
Member Author

/retest

@sohankunkerkar sohankunkerkar force-pushed the up-release-4.17 branch 2 times, most recently from 24abfa3 to 736bd12 Compare November 27, 2024 13:48
@sohankunkerkar
Copy link
Member Author

/test unit

Copy link
Member

@QiWang19 QiWang19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should work fine, but we will need to perform some manual upgrade testing to ensure everything functions correctly.

@sohankunkerkar
Copy link
Member Author

/test e2e-gcp-op

@openshift-ci openshift-ci bot changed the title [release-4.17] NO-JIRA: OCPBUGS-38292: controller: default to runc when upgrading clusters from 4.17 to 4.18 [release-4.17] OCPBUGS-38292: controller: default to runc when upgrading clusters from 4.17 to 4.18 Dec 16, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 16, 2024
@openshift-ci-robot
Copy link
Contributor

@sohankunkerkar: This pull request references Jira Issue OCPBUGS-38292, which is invalid:

  • expected Jira Issue OCPBUGS-38292 to depend on a bug targeting a version in 4.18.0 and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

We need to handle the following cases:

  • The user has already set the default runtime and then updates to 4.17.z with this change.
  • The user updates to 4.17.z with this change and then tries setting the default runtime of their choice.
  • The user updates from 4.17.z with the defaulting logic to 4.18 and then wants to set the runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mrunalp mrunalp added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 16, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 119a374 and 2 for PR HEAD 1fb734a in total

2 similar comments
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 119a374 and 2 for PR HEAD 1fb734a in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 119a374 and 2 for PR HEAD 1fb734a in total

@sohankunkerkar
Copy link
Member Author

/retest

3 similar comments
@sohankunkerkar
Copy link
Member Author

/retest

@sohankunkerkar
Copy link
Member Author

/retest

@sohankunkerkar
Copy link
Member Author

/retest

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 119a374 and 2 for PR HEAD 1fb734a in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2024

@sohankunkerkar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-techpreview 1fb734a link false /test e2e-gcp-op-techpreview
ci/prow/e2e-azure-ovn-upgrade-out-of-change 1fb734a link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit d7c30c8 into openshift:release-4.17 Dec 18, 2024
15 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@sohankunkerkar: Jira Issue OCPBUGS-38292: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38292 has been moved to the MODIFIED state.

Details

In response to this:

We need to handle the following cases:

  • The user has already set the default runtime and then updates to 4.17.z with this change.
  • The user updates to 4.17.z with this change and then tries setting the default runtime of their choice.
  • The user updates from 4.17.z with the defaulting logic to 4.18 and then wants to set the runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sohankunkerkar sohankunkerkar deleted the up-release-4.17 branch December 19, 2024 00:35
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.17.0-202412182334.p0.gd7c30c8.assembly.stream.el9.
All builds following this will include this PR.

haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Dec 23, 2025
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Dec 23, 2025
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Jan 9, 2026
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Jan 14, 2026
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Jan 15, 2026
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Jan 15, 2026
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Jan 15, 2026
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <pehunt@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.