-
Notifications
You must be signed in to change notification settings - Fork 465
Skip drain on Single Node deployment #2457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With the introduction of different ControlPlaneTopology types in the OpenShift cluster, the behaviour of the cluster may differ based on cluster type. For example: cluster with Single controlPlane node, it doesn't make sense to perform workload drain. ControllerConfig now understands ControlPlaneTopology:TopologyMode set in the cluster. Node controller later on reads the value from controllerConfig and sets value into node annotation `machineconfiguration.openshift.io/controlPlaneTopology`. While performing a configuration update, machine-config-daemon will read the annotation and based on controlPlaneTopology type, it will decide drain action. MCD skips drain if controlPlaneTopology is SingleReplica. In any other cases, it will default to perform drain.
|
/test e2e-aws-single-node |
pkg/daemon/daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you use the dn.Node reference instead, the linter is complaining the node variable isn't used anymore here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fixed it
Refactored drain logic and moved drain related functions into drain.go for easier maintenance.
Updated existing tests and added new test where node controller reads ControlPlaneTopology value from controllerConfig and compares with annotation `machineconfiguration.openshift.io/controlPlaneTopology` value set on the node.
|
unit test seems to be failing due to unrelated test |
|
/test e2e-aws-single-node |
|
/retest |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Just some minor questions/nits.
Also, let's say a user has a default 3-3 cluster. If they set that controllerconfig spec, would they be able to skip drains on their updates? Or does something overwrite that?
One last question: I wonder if ensureControllerConfigSpec function needs to be updated or not
| like the web console to tell users where to find the Kubernetes | ||
| API. | ||
| type: string | ||
| controlPlaneTopology: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is controlPlaneTopology coming from? I see in the commit message:
ControllerConfig now understands ControlPlaneTopology:TopologyMode set in the cluster.
But I don't see how we're syncing that into the controlPlaneTopology field in the controllerconfig, maybe I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syncMachineConfigController() calls resourceapply.ApplyControllerConfig that calls resourcemerge.EnsureControllerConfig which calls ensureControllerConfigSpec() and here we check if Infra object has been modified and update controllerconfig if changed . Infra object is the one where controlPlaneTopology value exist as well along with other data https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L86 . MCO was already reading and updating Infra content, so adding controlPlaneToplogy field in crd populates its value as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And operator reads infrastructure object from cluster at https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L220 and syncs the controllerconfigspec . There are too many things to followup...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, ok thanks for the explanation! Right it gets synced as part of the infra field. Not the clearest of code paths 🤦
No, MCO operator pod keeps resyncing value of controllerConfigspec and we read data from infrastructure cluster object, so in next sync it will revert the value to whatever is set cluster wide.
not needed, as we already do that at
|
|
log from a SNO cluster created using clusterbot and applied a MachineConfig: |
|
/test e2e-aws-single-node |
Also, made some minor fixes and spell checks
For double safety, confirmed same on a regular HA cluster. Updated |
|
/retest |
|
gcp-op failed due to unrelated reason, retesting |
|
Talked to SNO team, aws-single-node test is good from MCO side. Failing sub-tests should pass with openshift/origin#25936 . |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is looking good! Just a few more minor questions/nits. Approving 🎉
Will try to do some manual testing as well as give others a chance to provide feedback before LGTM
| f.run(getKey(mcp, t)) | ||
| } | ||
|
|
||
| func TestControlPlaneTopology(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a bit confused by what this is trying to test. Is it checking to see if the controllerconfig topology propagates to node annotations? Where is that being checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we are currently checking that with a valid controlPlaneToplogy i.e. SingleReplica , setClusterConfigAnnotation() gets called and works as expected. Since, in this unit test we have created controllerConfig with SingleReplica and we have set node annotation machineconfiguration.openshift.io/controlPlaneTopology to SingleReplica , expected result is no node change action such as patch.
Later on, I am thinking of extending the test to also perform machineconfiguration.openshift.io/controlPlaneTopology annotation change action. Adding this is a bit tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sorry for the dumb question, but I'm still not sure how we are testing that. If an action change happens as a bug, which line of code would return the error? I see we do the same for the above TestShouldDoNothing but I don't see any expectations for "nothing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when in test run() calls runController(), node-controller syncHandler(pool) gets called at https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller_test.go#L114 . As you know when this runs, various actions can be generated based on what all operation on node has been performed. In test we filter out list and watch actionhttps://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller_test.go#L121as these actions doesn't really update anything. For other actions like patch, additional action will be check in later line during checkAction()
| } | ||
|
|
||
| // We are here, that means we need to cordon and drain node | ||
| MCDDrainErr.WithLabelValues(dn.node.Name, "").Set(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why do we set this here again? Is it to clear a previously failing drain's error if that one hit the global timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, I am not sure. I kept logic from earlier implementation. @kikisdeliveryservice would know better.
|
Tried to do some manual testing but failed, will try again next week and LGTM if nobody else has any concerns |
|
I think this looks good! Let's get this in, and iterate from there if there are any issues. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
skipping optional test from retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/skip |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
1 similar comment
|
/retest |
|
@sinnykumari: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
With the introduction of different ControlPlaneTopology types
in the OpenShift cluster, the behaviour of the cluster may differ
based on cluster type. For example: cluster with Single
controlPlane node, it doesn't make sense to perform workload drain.
ControllerConfig now understands ControlPlaneTopology:TopologyMode
set in the cluster. Node controller later on reads the value from
controllerConfig and sets value into node annotation
machineconfiguration.openshift.io/controlPlaneTopology.While performing a configuration update, machine-config-daemon
will read the annotation and based on controlPlaneTopology
type, it will decide drain action.
MCD skips drain if controlPlaneTopology is SingleReplica.
Part of - openshift/enhancements#560
This PR also: