-
Notifications
You must be signed in to change notification settings - Fork 50
MCO-1866: Ignore boot image differences while reconciling Provider Configs #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-1866: Ignore boot image differences while reconciling Provider Configs #368
Conversation
|
@djoshy: This pull request references MCO-1866 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
/test all |
70de510 to
90e8619
Compare
|
/test unit |
90e8619 to
8372c75
Compare
|
/test unit |
|
/test vendor |
|
@djoshy: This pull request references MCO-1866 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithInstanceType("m5.large").Build(), | ||
| }, | ||
| compareConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithInstanceType("m5.xlarge").Build(), | ||
| expectedDiff: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the boolean here relies on the vendored github.com/go-test/deep to produce output text that is user-accessible. If we're just using that string in logging and such, that's probably fine. If we're exposing the string in user-facing status, we might want to assert a match with a locally-hardcoded expected string, so we hear via failing CI if a future vendor bump changes the output, and can decide how we feel about the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a fair ask, I'm not sure how user facing it is, but it does clean up the code a fair bit. Let me know if that push matches with your ask!
8372c75 to
29f4d01
Compare
damdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the implementation looks good! 🎉
Just a bunch of nits on the tests but otherwise we are good to go
| Entry("with different AWS AMIs", diffTableInput{ | ||
| baseConfig: AWSProviderConfig{ | ||
| providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-12345678")}).Build(), | ||
| }, | ||
| compareConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-87654321")}).Build(), | ||
| expectedDiff: false, | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment on this test case explaining why the diff is false even if they are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, just saw these 😅 Will fix and push again. Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, let me know if that is sufficient. I'm happy to update it if it needs to be more verbose 😄
| Entry("with different images", diffTableInput{ | ||
| baseConfig: AzureProviderConfig{ | ||
| providerConfig: *machinev1beta1resourcebuilder.AzureProviderSpec().WithImage(machinev1beta1.Image{Publisher: "RedHat", Offer: "RHEL", SKU: "8-LVM", Version: "8.4.2021040911"}).Build(), | ||
| }, | ||
| compareConfig: *machinev1beta1resourcebuilder.AzureProviderSpec().WithImage(machinev1beta1.Image{Publisher: "RedHat", Offer: "RHEL", SKU: "8-LVM", Version: "8.5.2021111016"}).Build(), | ||
| expectedDiff: false, | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
| Entry("with different GCP disks.images values", diffTableInput{ | ||
| baseConfig: GCPProviderConfig{ | ||
| providerConfig: *machinev1beta1resourcebuilder.GCPProviderSpec().WithDisks([]*machinev1beta1.GCPDisk{ | ||
| { | ||
| AutoDelete: true, | ||
| Boot: true, | ||
| SizeGB: 100, | ||
| Type: "pd-standard", | ||
| Image: "projects/rhcos-cloud/global/images/rhcos-416-92-202301311551-0-gcp-x86-64", | ||
| }, | ||
| }).Build(), | ||
| }, | ||
| compareConfig: *machinev1beta1resourcebuilder.GCPProviderSpec().WithDisks([]*machinev1beta1.GCPDisk{ | ||
| { | ||
| AutoDelete: true, | ||
| Boot: true, | ||
| SizeGB: 100, | ||
| Type: "pd-standard", | ||
| Image: "projects/rhcos-cloud/global/images/rhcos-417-92-202302090245-0-gcp-x86-64", | ||
| }, | ||
| }).Build(), | ||
| expectedDiff: false, | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
| Entry("with different AWS AMIs", diffTableInput{ | ||
| basePC: &providerConfig{ | ||
| platformType: configv1.AWSPlatformType, | ||
| aws: AWSProviderConfig{ | ||
| providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-12345678")}).Build(), | ||
| }, | ||
| }, | ||
| comparePC: &providerConfig{ | ||
| platformType: configv1.AWSPlatformType, | ||
| aws: AWSProviderConfig{ | ||
| providerConfig: *machinev1beta1resourcebuilder.AWSProviderSpec().WithAMI(machinev1beta1.AWSResourceReference{ID: stringPtr("ami-87654321")}).Build(), | ||
| }, | ||
| }, | ||
| expectedDiff: false, | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for these
29f4d01 to
263f40c
Compare
|
/retest |
|
/test e2e-openstack-operator e2e-nutanix-ovn |
|
This really shouldn't be affecting openstack deployments, strange! /test e2e-openstack-operator ci/prow/e2e-aws-ovn-etcd-scaling history looks fairly red, so perhaps not related to this PR either? |
Correct the scaling jobs are permafailing for a while now, don't worry about those, we'll override them. |
damdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks a lot for this.
/approve
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damdo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/override ci/prow/e2e-aws-ovn-etcd-scaling ci/prow/e2e-azure-ovn-etcd-scaling ci/prow/e2e-gcp-ovn-etcd-scaling ci/prow/e2e-vsphere-ovn-etcd-scaling These have been permafailing for a while now, so failure is not related to this PR. |
|
CC @jeana-redhat we will want to make sure that this change is documented |
ACK and TY - which versions? I don't see one set on either https://issues.redhat.com//browse/MCO-1866 or its parent https://issues.redhat.com/browse/MCO-1007 |
|
@damdo: Overrode contexts on behalf of damdo: ci/prow/e2e-aws-ovn-etcd-scaling, ci/prow/e2e-azure-ovn-etcd-scaling, ci/prow/e2e-gcp-ovn-etcd-scaling, ci/prow/e2e-vsphere-ovn-etcd-scaling DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Ah sorry about that - this will only affect 4.21+. Q: With respect to the |
|
@djoshy: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@djoshy I'll ping our QE folks about this, thanks! |
|
Not sure if I understand the feature correctly. I tried this feature on AWS today and have some questions:
I am not sure if the CPMS
I am not sure if the new master is expected to be the new |
In worker machinesets, boot images updates are enabled by default in AWS. As a result, any AMI value you manually set will be stomped back to the boot image defined by the OCP release of the cluster. We have not yet implemented boot image updates for CPMS yet, so the behavior you describe is expected.
The goal of this PR is to prevent rollouts of the control plane when the MCO performs boot image updates(This comment explains what would happen if we didn't). The diff listed in the logs, to my understanding, is only used to check if a replacement of an existing control plane machine is required. For control plane machines(and nodes) that already exist, boot image updates(i.e. AMI values for AWS) are irrelevant as they would've already pivoted to the RHCOS designated by the OCP release image. Boot images only affect nodes that are scaled up in the future, so what you described is expected behavior; any new machines spawned(whether it was due to a diff of another field, or manually scaled by the admin) should follow the AMI defined by the CPMS spec. |
|
Thank you @djoshy for the detailed explanation, good to know. Let me continue testing on other platforms (Azure, GCP) and report the results later. |
|
/verified by @huali9 |
|
@huali9: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f75055b
into
openshift:main
This PR implements a
Diff()function that ignores boot image values for the AWS, Azure and GCPProviderConfigs. This allows the MCO's boot image controller to update them without causing a control plane rollout as described in this comment. I've also added a few units to verify this behavior.I've manually tested this on AWS, GCP and Azure clusters by setting the boot image to older, valid values and observing that no rollout is taking place when the CPMS is set to
RollingUpdate. Controller logs indicate no diff like so:The vsphere platform already seems to have a diff function that ignores boot image updates, so I did not need to add anything for that case.