-
Notifications
You must be signed in to change notification settings - Fork 462
MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267
Conversation
|
Skipping CI for Draft Pull Request. |
7afa70a to
1af0015
Compare
|
@djoshy: This pull request references MCO-1009 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-1009 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5d9737d to
5fa5f95
Compare
25cf96d to
ee568f2
Compare
ee568f2 to
7f3d08b
Compare
|
@djoshy: This pull request references MCO-1009 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e415e94 to
4ec0517
Compare
|
/test unit |
8bbfba2 to
f3d90b0
Compare
|
@djoshy: This pull request references MCO-1009 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This enables the Operator to read in user defined node disruption polices, merge it with cluster defaults and then update the status on the MachineConfiguration object. This is only done when FeatureGateNodeDisruptionPolicy is enabled.
This change makes the daemon read in NodeDisruptionPolicyStatus during an update and queue up the appropriate actions that the policy requests. No deduping is being done for the policies currently. This will not take place during firstboot as the featureGateAccessor does not exist then.
f3d90b0 to
103bdc1
Compare
|
Made some additional fixes related to the transition to TechPreview featureset. In my tests, I noticed that operator would not get far enough to actually render the
With the above fixes, the only variable to account for is the operator's lease acquire time, after which the status will be populated, with which the daemons can orchestrate updates. The two minute timeout is largely arbitrary. During the "feature gate" transition update, some nodes may have to fallback to the legacy path in case the operator is not fast enough to populate the Note: This is only applicable for cases where the feature gate was applied after the cluster was installed with a "Default" featureset. If the cluster featureset was TechPreview at install time, this issue would not happen, it is the transition that gunked up the process. |
|
@djoshy: This pull request references MCO-1009 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| for _, policyFile := range clusterPolicies.Files { | ||
| klog.V(4).Infof("comparing policy path %s to diff path %s", policyFile.Path, diffPath) | ||
| if policyFile.Path == diffPath { | ||
| klog.Infof("NodeDisruptionPolicy found for diff file %s!", diffPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Perhaps we should exclude ! in the end since it can be confusing that it is part of diffPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, will fix alongside anything QE might find on next pass. Thanks!
| // If at any point an error occurs, we reboot the node so that node has correct configuration. | ||
| func (dn *Daemon) performPostConfigChangeNodeDisruptionAction(postConfigChangeActions []opv1.NodeDisruptionPolicyStatusAction, configName string) error { | ||
|
|
||
| logSystem("Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config %s", configName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can just say "Performing post config change action".
We can skip drain already complete/skipped since this is already captured in https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L793 if we drain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, yeah can clean this up. The reason I kept that marker is this loop will also roll through the Drain action, but I can put in a no-op condition message for that (:
sinnykumari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor nits which are optional.
Other than that this looks great. Nice work David!
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
pre-merge testing result:
Issues:
Test cases are linked to epic MCO-507 /label qe-approved |
|
@djoshy: This pull request references MCO-1009 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
@djoshy: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/override ci/prow/e2e-hypershift Hypershift failures are unrelated and being fixed. Since the PR has passed it in the past, I'm overriding it. |
|
Looks like openshift/hypershift#3938 doesn't help much with HyperShift failure. Looking at recent failure on this PR doesn't look like MCO specific. Also, we had a successful run on this PR with current changes in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4267/pull-ci-openshift-machine-config-operator-master-e2e-hypershift/1782558388421398528 . This PR should be safe to go in. overriding e2e-hypershift test to get this feature work merged. |
|
@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/e2e-hypershift DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@djoshy: Overrode contexts on behalf of djoshy: ci/prow/e2e-hypershift DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build ose-machine-config-operator-container-v4.16.0-202404251110.p0.g9b99954.assembly.stream.el9 for distgit ose-machine-config-operator. |
This PR does the following:
How to test:
In this case, the SSHkey section has been overriden.
test.serviceunit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.Some things to note
/etc/containers/registries.d/to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.