Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Mar 26, 2025

Closes: OCPBUGS-53427

- What I did

The kubelet skew guards are from 1471d2c (#2658). But the Kube API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check.

4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like:

osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

but we are ok with RHCOS entries like:

osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0

- How to verify it

Install a 4.18 cluster with this fix. Its machine-config ClusterOperator should be Upgradeable=True. Install a bare-RHEL node. The ClusterOperator should become Upgradeable=False and complain about that node. Remove the bare-RHEL node or somehow convert it to RHCOS. The ClusterOperator should become Upgradeable=True again.

- Description for the changelog

The machine-config operator now detects bare-RHEL Nodes and warns that they will not be compatible with OpenShift 4.19.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 26, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-53427, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-53427 to depend on a bug targeting a version in 4.19.0 and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Closes: OCPBUGS-53427

- What I did

The kubelet skew guards are from 1471d2c (#2658). But the Kube API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check.

4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like:

osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

but we are ok with RHCOS entries like:

osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0

- How to verify it

Install a 4.18 cluster with this fix. Its machine-config ClusterOperator should be Upgradeable=True. Install a bare-RHEL node. The ClusterOperator should become Upgradeable=False and complain about that node. Remove the bare-RHEL node or somehow convert it to RHCOS. The ClusterOperator should become Upgradeable=True again.

- Description for the changelog

The machine-config operator now detects bare-RHEL Nodes and warns that they will not be compatible with OpenShift 4.19.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking wking force-pushed the only-rhcos-on-4.19 branch 2 times, most recently from 68256ae to 5fc0354 Compare March 27, 2025 15:46
@wking wking force-pushed the only-rhcos-on-4.19 branch from 5fc0354 to 9915680 Compare April 3, 2025 21:10
@wking wking force-pushed the only-rhcos-on-4.19 branch 3 times, most recently from aabb1bf to 56ba6cb Compare April 15, 2025 22:57
The kubelet skew guards are from 1471d2c (Bug 1986453: Check for
API server and node versions skew, 2021-07-27, openshift#2658).  But the Kube
API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (add
KubeletVersionSkewController, 2021-08-26,
openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (eus-upgrades-mvp: don't enforce
skew check in MCO, 2021-04-29, openshift/enhancements#762) had shifted
the proposal form MCO-guards to KAS-guards, so I'm not entirely clear
on why the MCO guards landed at all.  But it's convenient for me that
they did, because while I'm dropping them here, I'm recycling the Node
lister for a new check.

4.19 is dropping bare, package-managed RHEL support.  I'd initially
thought about looking for RHEL entries like:

  osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

while excluding RHCOS entries like:

  osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0

But instead of switching on osImage, I'm using the
node.openshift.io/os_id label to find package-managed RHEL Nodes.  The
machine-config operator is setting up the label [1] based on the ID
value in /etc/os-release.  On RHCOS instances, the ID value is 'rhcos'
[2].  On package-managed RHEL, it's 'rhel' [3,4].

[1]: https://github.com/openshift/machine-config-operator/blob/ddc18e84f4a0650e0e87aa0a4f90f9cf01b5259c/templates/worker/01-worker-kubelet/_base/units/kubelet.service.yaml#L19-L31
[2]: https://github.com/openshift/os/blob/41f6a028d37b750db0bf4257447d809bd9cbe4bf/manifest-ocp-rhel-9.6.yaml#L41
[3]: https://github.com/openshift/enhancements/blob/ea465e192bfb58ec8654f1c904a4af68777f68ec/enhancements/rhcos/split-rhcos-into-layers.md?plain=1#L416
[4]: https://github.com/openshift/machine-config-operator/blob/ddc18e84f4a0650e0e87aa0a4f90f9cf01b5259c/pkg/daemon/osrelease/osrelease.go#L69
@wking wking force-pushed the only-rhcos-on-4.19 branch from 56ba6cb to 13cceb0 Compare April 29, 2025 22:13
@wking wking changed the title WIP: OCPBUGS-53427: pkg/operator/status: Drop kubelet skew guard, add RHEL guard OCPBUGS-53427: pkg/operator/status: Drop kubelet skew guard, add RHEL guard Apr 29, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2025
@wking
Copy link
Member Author

wking commented Apr 29, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 29, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-53427, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.z) matches configured target version for branch (4.18.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-54611 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-54611 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei+old@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 30, 2025

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-techpreview 13cceb0 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-vsphere-ovn-upi-zones 13cceb0 link false /test e2e-vsphere-ovn-upi-zones
ci/prow/e2e-vsphere-ovn-upi 13cceb0 link false /test e2e-vsphere-ovn-upi
ci/prow/e2e-azure-ovn-upgrade-out-of-change 13cceb0 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-vsphere-ovn-zones 13cceb0 link false /test e2e-vsphere-ovn-zones

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@djoshy
Copy link
Contributor

djoshy commented Apr 30, 2025

Generally seems sane to me, just one question. We seem to be setting the Upgradeable condition to unknown when we run into errors in the node check process, how does the CVO react to this? Is it still in an upgrade block?

/approve

Will add final backport tags after QE has done pre-merge testing.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2025
@wking
Copy link
Member Author

wking commented Apr 30, 2025

We seem to be setting the Upgradeable condition to unknown when we run into errors in the node check process, how does the CVO react to this? Is it still in an upgrade block?

API docs:

The cluster-version operator will allow updates when this condition is not False, including when it is missing, True, or Unknown.

so we're failing-open. And maybe not alerting; I could see us growing alerting for any Upgradeable=Unknown... But that's an independent handling decision at the CVO level, not something I think the MCO needs to worry about or try to work around.

@ptalgulk01
Copy link

Pre-merge verification steps:

Have verified using IPI based AWS 4.18 cluster.
Used the below template to install cluster

private-templates/functionality-testing/aos-4_18/ipi-on-aws/versioned-installer-customer_vpc
create_int_svc_instance: "yes" 

To add the rhel node used the jenkins job

Detail steps are define here https://issues.redhat.com/browse/OCPBUGS-53427?focusedId=27088479&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-27088479

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label May 6, 2025
@djoshy
Copy link
Contributor

djoshy commented May 6, 2025

/lgtm

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label May 6, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 57836ff into openshift:release-4.18 May 6, 2025
14 of 19 checks passed
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-53427: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-53427 has been moved to the MODIFIED state.

Details

In response to this:

Closes: OCPBUGS-53427

- What I did

The kubelet skew guards are from 1471d2c (#2658). But the Kube API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check.

4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like:

osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

but we are ok with RHCOS entries like:

osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0

- How to verify it

Install a 4.18 cluster with this fix. Its machine-config ClusterOperator should be Upgradeable=True. Install a bare-RHEL node. The ClusterOperator should become Upgradeable=False and complain about that node. Remove the bare-RHEL node or somehow convert it to RHCOS. The ClusterOperator should become Upgradeable=True again.

- Description for the changelog

The machine-config operator now detects bare-RHEL Nodes and warns that they will not be compatible with OpenShift 4.19.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking wking deleted the only-rhcos-on-4.19 branch May 6, 2025 16:57
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.18.0-202505061704.p0.g57836ff.assembly.stream.el9.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.18.0-0.nightly-2025-05-06-231850

wking added a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request May 20, 2025
In 4.19:

* 377a78b (pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition, 2024-12-16, openshift#4760).
* 0c21907 (pkg/operator/status: Drop kubelet skew guard, 2025-04-03, openshift#4970).

But in 4.18, we're using the other order:

* 13cceb0 (pkg/operator/status: Drop kubelet skew guard, add RHEL guard, 2025-03-26, openshift#4956).
* 20fe075 (pkg/operator/status: Drop PoolUpdating as an Upgradeable=False condition, 2024-12-16, openshift#5065).

So I'm adding this follow-up commit within openshift#5065 to remove the
'updating' variable that both the kubelet-skew-guard and the
PoolUpdating guard had used, but which we no longer need now that both
are gone in 4.18.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.