Skip to content

Conversation

@djoshy
Copy link
Contributor

@djoshy djoshy commented Sep 10, 2025

This PR implements a Diff() function that ignores boot image values for the AWS, Azure and GCP ProviderConfigs. This allows the MCO's boot image controller to update them without causing a control plane rollout as described in this comment. I've also added a few units to verify this behavior.

I've manually tested this on AWS, GCP and Azure clusters by setting the boot image to older, valid values and observing that no rollout is taking place when the CPMS is set to RollingUpdate. Controller logs indicate no diff like so:

I0911 13:16:01.118480       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-0" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-0" index=0 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:16:01.118513       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-1" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-1" index=1 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:16:01.118525       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-2" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-2" index=2 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:16:01.118540       1 status.go:119] "Observed Machine Configuration" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" observedGeneration=2 replicas=3 readyReplicas=3 updatedReplicas=3 unavailableReplicas=0
I0911 13:16:01.118970       1 controller.go:492] "Owner reference already present on machine" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineNamespace="openshift-machine-api" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-1"
I0911 13:16:01.119043       1 controller.go:492] "Owner reference already present on machine" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineNamespace="openshift-machine-api" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-2"
I0911 13:16:01.119102       1 controller.go:492] "Owner reference already present on machine" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineNamespace="openshift-machine-api" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-0"
I0911 13:16:01.119136       1 updates.go:212] "No updates required" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" updateStrategy="RollingUpdate"
I0911 13:16:01.119833       1 status.go:57] "No update to control plane machine set status required" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster"
I0911 13:16:01.119986       1 controller.go:231] "Finished reconciling control plane machine set" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster"
I0911 13:17:40.000145       1 reflector.go:946] "Watch close" logger="controller-runtime.cache" reflector="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:114" type="*v1.ClusterOperator" totalItems=9
I0911 13:17:47.900086       1 reflector.go:450] "Forcing resync" logger="controller-runtime.cache" reflector="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:114"
I0911 13:17:47.900252       1 controller.go:175] "Reconciling control plane machine set" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster"
I0911 13:17:47.900433       1 controller.go:454] "Finalizer already present on control plane machine set" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster"
I0911 13:17:47.902340       1 mapping.go:52] "No failure domains provided" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster"
I0911 13:17:47.955334       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-0" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-0" index=0 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:17:47.955374       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-1" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-1" index=1 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:17:47.955404       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-2" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-2" index=2 ready=true needsUpdate=false diff=null errorMessage=""

The vsphere platform already seems to have a diff function that ignores boot image updates, so I did not need to add anything for that case.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 10, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 10, 2025

@djoshy: This pull request references MCO-1866 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

Still a WIP, opened for testing.

This PR implements a Diff() function that ignores boot image values for the AWS, Azure and GCP ProviderConfigs. This allows the MCO's boot image controller to update them without causing a control plane rollout as described in this comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 10, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Sep 10, 2025

/test all

@djoshy djoshy force-pushed the add-provider-boot-image-exception branch from 70de510 to 90e8619 Compare September 10, 2025 14:03
@djoshy
Copy link
Contributor Author

djoshy commented Sep 10, 2025

/test unit
/test lint

@djoshy
Copy link
Contributor Author

djoshy commented Sep 10, 2025

/test unit
/test lint

@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 10, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Sep 10, 2025

/test vendor
/test verify-deps

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 11, 2025

@djoshy: This pull request references MCO-1866 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This PR implements a Diff() function that ignores boot image values for the AWS, Azure and GCP ProviderConfigs. This allows the MCO's boot image controller to update them without causing a control plane rollout as described in this comment. I've also added a few units to verify this behavior.

I've manually tested this on AWS, GCP and Azure clusters by setting the boot image to older, valid values and observing that no rollout is taking place when the CPMS is set to RollingUpdate. Controller logs indicate no diff like so:

I0911 13:16:01.118480       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-0" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-0" index=0 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:16:01.118513       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-1" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-1" index=1 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:16:01.118525       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-2" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-2" index=2 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:16:01.118540       1 status.go:119] "Observed Machine Configuration" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" observedGeneration=2 replicas=3 readyReplicas=3 updatedReplicas=3 unavailableReplicas=0
I0911 13:16:01.118970       1 controller.go:492] "Owner reference already present on machine" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineNamespace="openshift-machine-api" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-1"
I0911 13:16:01.119043       1 controller.go:492] "Owner reference already present on machine" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineNamespace="openshift-machine-api" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-2"
I0911 13:16:01.119102       1 controller.go:492] "Owner reference already present on machine" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" machineNamespace="openshift-machine-api" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-0"
I0911 13:16:01.119136       1 updates.go:212] "No updates required" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster" updateStrategy="RollingUpdate"
I0911 13:16:01.119833       1 status.go:57] "No update to control plane machine set status required" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster"
I0911 13:16:01.119986       1 controller.go:231] "Finished reconciling control plane machine set" controller="controlplanemachineset" reconcileID="416ffce6-c3cc-4485-a234-c8fede82bba9" namespace="openshift-machine-api" name="cluster"
I0911 13:17:40.000145       1 reflector.go:946] "Watch close" logger="controller-runtime.cache" reflector="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:114" type="*v1.ClusterOperator" totalItems=9
I0911 13:17:47.900086       1 reflector.go:450] "Forcing resync" logger="controller-runtime.cache" reflector="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:114"
I0911 13:17:47.900252       1 controller.go:175] "Reconciling control plane machine set" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster"
I0911 13:17:47.900433       1 controller.go:454] "Finalizer already present on control plane machine set" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster"
I0911 13:17:47.902340       1 mapping.go:52] "No failure domains provided" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster"
I0911 13:17:47.955334       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-0" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-0" index=0 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:17:47.955374       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-1" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-1" index=1 ready=true needsUpdate=false diff=null errorMessage=""
I0911 13:17:47.955404       1 provider.go:306] "Gathered Machine Info" controller="controlplanemachineset" reconcileID="c182739d-3e18-4b3a-a74c-0fb24ee68607" namespace="openshift-machine-api" name="cluster" machineName="ci-ln-1z1b2wb-1d09d-87x75-master-2" nodeName="ci-ln-1z1b2wb-1d09d-87x75-master-2" index=2 ready=true needsUpdate=false diff=null errorMessage=""

The vsphere platform already seems to have a diff function that ignores boot image updates, so I did not need to add anything for that case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy marked this pull request as ready for review September 11, 2025 13:40
@djoshy djoshy changed the title [WIP] MCO-1866: Ignore boot image differences while reconciling Provider Configs MCO-1866: Ignore boot image differences while reconciling Provider Configs Sep 11, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 11, 2025
@openshift-ci openshift-ci bot requested review from damdo and nrb September 11, 2025 13:45
providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithInstanceType("m5.large").Build(),
},
compareConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithInstanceType("m5.xlarge").Build(),
expectedDiff: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the boolean here relies on the vendored github.com/go-test/deep to produce output text that is user-accessible. If we're just using that string in logging and such, that's probably fine. If we're exposing the string in user-facing status, we might want to assert a match with a locally-hardcoded expected string, so we hear via failing CI if a future vendor bump changes the output, and can decide how we feel about the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a fair ask, I'm not sure how user facing it is, but it does clean up the code a fair bit. Let me know if that push matches with your ask!

@djoshy djoshy force-pushed the add-provider-boot-image-exception branch from 8372c75 to 29f4d01 Compare September 12, 2025 15:51
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the implementation looks good! 🎉
Just a bunch of nits on the tests but otherwise we are good to go

Comment on lines 273 to 269
Entry("with different AWS AMIs", diffTableInput{
baseConfig: AWSProviderConfig{
providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-12345678")}).Build(),
},
compareConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-87654321")}).Build(),
expectedDiff: false,
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment on this test case explaining why the diff is false even if they are different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, just saw these 😅 Will fix and push again. Thanks for the review!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, let me know if that is sufficient. I'm happy to update it if it needs to be more verbose 😄

Comment on lines 145 to 141
Entry("with different images", diffTableInput{
baseConfig: AzureProviderConfig{
providerConfig: *machinev1beta1resourcebuilder.AzureProviderSpec().WithImage(machinev1beta1.Image{Publisher: "RedHat", Offer: "RHEL", SKU: "8-LVM", Version: "8.4.2021040911"}).Build(),
},
compareConfig: *machinev1beta1resourcebuilder.AzureProviderSpec().WithImage(machinev1beta1.Image{Publisher: "RedHat", Offer: "RHEL", SKU: "8-LVM", Version: "8.5.2021111016"}).Build(),
expectedDiff: false,
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment on lines 137 to 149
Entry("with different GCP disks.images values", diffTableInput{
baseConfig: GCPProviderConfig{
providerConfig: *machinev1beta1resourcebuilder.GCPProviderSpec().WithDisks([]*machinev1beta1.GCPDisk{
{
AutoDelete: true,
Boot: true,
SizeGB: 100,
Type: "pd-standard",
Image: "projects/rhcos-cloud/global/images/rhcos-416-92-202301311551-0-gcp-x86-64",
},
}).Build(),
},
compareConfig: *machinev1beta1resourcebuilder.GCPProviderSpec().WithDisks([]*machinev1beta1.GCPDisk{
{
AutoDelete: true,
Boot: true,
SizeGB: 100,
Type: "pd-standard",
Image: "projects/rhcos-cloud/global/images/rhcos-417-92-202302090245-0-gcp-x86-64",
},
}).Build(),
expectedDiff: false,
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment on lines 1047 to 1056
Entry("with different AWS AMIs", diffTableInput{
basePC: &providerConfig{
platformType: configv1.AWSPlatformType,
aws: AWSProviderConfig{
providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-12345678")}).Build(),
},
},
comparePC: &providerConfig{
platformType: configv1.AWSPlatformType,
aws: AWSProviderConfig{
providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-87654321")}).Build(),
},
},
expectedDiff: false,
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for these

@djoshy djoshy force-pushed the add-provider-boot-image-exception branch from 29f4d01 to 263f40c Compare September 12, 2025 17:02
@damdo
Copy link
Member

damdo commented Sep 15, 2025

/retest

@damdo
Copy link
Member

damdo commented Sep 15, 2025

/test e2e-openstack-operator e2e-nutanix-ovn

@djoshy
Copy link
Contributor Author

djoshy commented Sep 16, 2025

This really shouldn't be affecting openstack deployments, strange!

/test e2e-openstack-operator

ci/prow/e2e-aws-ovn-etcd-scaling history looks fairly red, so perhaps not related to this PR either?

@damdo
Copy link
Member

damdo commented Sep 16, 2025

ci/prow/e2e-aws-ovn-etcd-scaling history looks fairly red, so perhaps not related to this PR either?

Correct the scaling jobs are permafailing for a while now, don't worry about those, we'll override them.

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Thanks a lot for this.

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damdo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 16, 2025
@damdo
Copy link
Member

damdo commented Sep 16, 2025

/override ci/prow/e2e-aws-ovn-etcd-scaling ci/prow/e2e-azure-ovn-etcd-scaling ci/prow/e2e-gcp-ovn-etcd-scaling ci/prow/e2e-vsphere-ovn-etcd-scaling

These have been permafailing for a while now, so failure is not related to this PR.

@JoelSpeed
Copy link
Contributor

CC @jeana-redhat we will want to make sure that this change is documented

@jeana-redhat
Copy link

CC @jeana-redhat we will want to make sure that this change is documented

ACK and TY - which versions? I don't see one set on either https://issues.redhat.com//browse/MCO-1866 or its parent https://issues.redhat.com/browse/MCO-1007

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

@damdo: Overrode contexts on behalf of damdo: ci/prow/e2e-aws-ovn-etcd-scaling, ci/prow/e2e-azure-ovn-etcd-scaling, ci/prow/e2e-gcp-ovn-etcd-scaling, ci/prow/e2e-vsphere-ovn-etcd-scaling

Details

In response to this:

/override ci/prow/e2e-aws-ovn-etcd-scaling ci/prow/e2e-azure-ovn-etcd-scaling ci/prow/e2e-gcp-ovn-etcd-scaling ci/prow/e2e-vsphere-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@djoshy
Copy link
Contributor Author

djoshy commented Sep 16, 2025

ACK and TY - which versions? I don't see one set on either https://issues.redhat.com//browse/MCO-1866 or its parent https://issues.redhat.com/browse/MCO-1007

Ah sorry about that - this will only affect 4.21+.

Q: With respect to the verified tag, while I've manually tested this, I'm not sure if that is good enough to qualify as pre-merge testing as I'm definitely not a QE expert 😄 Would someone from cloud QE be able to pre merge test this? I can check with MCO QE as well, but they're quite backed up at the moment.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-operator 263f40c link false /test e2e-openstack-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@damdo
Copy link
Member

damdo commented Sep 16, 2025

ACK and TY - which versions? I don't see one set on either https://issues.redhat.com//browse/MCO-1866 or its parent https://issues.redhat.com/browse/MCO-1007

Ah sorry about that - this will only affect 4.21+.

Q: With respect to the verified tag, while I've manually tested this, I'm not sure if that is good enough to qualify as pre-merge testing as I'm definitely not a QE expert 😄 Would someone from cloud QE be able to pre merge test this? I can check with MCO QE as well, but they're quite backed up at the moment.

@djoshy I'll ping our QE folks about this, thanks!

@huali9
Copy link

huali9 commented Sep 17, 2025

Not sure if I understand the feature correctly. I tried this feature on AWS today and have some questions:

  1. I update the CPMS changing ami id to another value (from ami-082a55a580d5538ed to ami-0d4a7b7677c0c883f), the CPMS ami id value is updated to the new value ami-0d4a7b7677c0c883f, it doesn't trigger update, no diff info in CPMS logs.

I am not sure if the CPMS ami id value should be reverted back to the original one, because for worker machineset, if I change the worker machinset ami id value, it will be reverted back to the original value.

  1. Then I continue update the CPMS changing volumeSize (from 120 to 130), it triggers update, observed diff info diff=["BlockDevices.slice[0].EBS.VolumeSize: 130 != 120"] in CPMS logs. The new master is created with the new ami (ami-0d4a7b7677c0c883f) and the new volumeSize (120)

I am not sure if the new master is expected to be the new ami or the old one. Because now it seems that the diff info (only mention volumeSize) and the actual update behavior (both ami id and volumeSize also be updated) are inconsistent.

@djoshy
Copy link
Contributor Author

djoshy commented Sep 17, 2025

I am not sure if the CPMS ami id value should be reverted back to the original one, because for worker machineset, if I change the worker machinset ami id value, it will be reverted back to the original value.

In worker machinesets, boot images updates are enabled by default in AWS. As a result, any AMI value you manually set will be stomped back to the boot image defined by the OCP release of the cluster. We have not yet implemented boot image updates for CPMS yet, so the behavior you describe is expected.

I am not sure if the new master is expected to be the new ami or the old one. Because now it seems that the diff info (only mention volumeSize) and the actual update behavior (both ami id and volumeSize also be updated) are inconsistent

The goal of this PR is to prevent rollouts of the control plane when the MCO performs boot image updates(This comment explains what would happen if we didn't). The diff listed in the logs, to my understanding, is only used to check if a replacement of an existing control plane machine is required. For control plane machines(and nodes) that already exist, boot image updates(i.e. AMI values for AWS) are irrelevant as they would've already pivoted to the RHCOS designated by the OCP release image. Boot images only affect nodes that are scaled up in the future, so what you described is expected behavior; any new machines spawned(whether it was due to a diff of another field, or manually scaled by the admin) should follow the AMI defined by the CPMS spec.

@huali9
Copy link

huali9 commented Sep 18, 2025

Thank you @djoshy for the detailed explanation, good to know. Let me continue testing on other platforms (Azure, GCP) and report the results later.

@huali9
Copy link

huali9 commented Sep 18, 2025

/verified by @huali9
pre-merge tested on AWS, GCP and Azure.
only update boot image (ami for AWS,  disks[0].image for GCP, image for Azure) in the CPMS do not trigger update, the new value is stored in the CPMS;
only update some other fields trigger update as before;
update both boot image and some other fields trigger update, new masters are created with the new boot image and the new other fields values.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 18, 2025
@openshift-ci-robot
Copy link

@huali9: This PR has been marked as verified by @huali9.

Details

In response to this:

/verified by @huali9
pre-merge tested on AWS, GCP and Azure.
only update boot image (ami for AWS,  disks[0].image for GCP, image for Azure) in the CPMS do not trigger update, the new value is stored in the CPMS;
only update some other fields trigger update as before;
update both boot image and some other fields trigger update, new masters are created with the new boot image and the new other fields values.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants