Skip to content

Conversation

@sinnykumari
Copy link
Contributor

With the introduction of different ControlPlaneTopology types
in the OpenShift cluster, the behaviour of the cluster may differ
based on cluster type. For example: cluster with Single
controlPlane node, it doesn't make sense to perform workload drain.

ControllerConfig now understands ControlPlaneTopology:TopologyMode
set in the cluster. Node controller later on reads the value from
controllerConfig and sets value into node annotation
machineconfiguration.openshift.io/controlPlaneTopology.

While performing a configuration update, machine-config-daemon
will read the annotation and based on controlPlaneTopology
type, it will decide drain action.
MCD skips drain if controlPlaneTopology is SingleReplica.

Part of - openshift/enhancements#560

This PR also:

  • refactors drain logic
  • adds related unit tests

With the introduction of different ControlPlaneTopology types
in the OpenShift cluster, the behaviour of the cluster may differ
based on cluster type. For example: cluster with Single
controlPlane node, it doesn't make sense to perform workload drain.

ControllerConfig now understands ControlPlaneTopology:TopologyMode
set in the cluster. Node controller later on reads the value from
controllerConfig and sets value into node annotation
`machineconfiguration.openshift.io/controlPlaneTopology`.

While performing a configuration update, machine-config-daemon
will read the annotation and based on controlPlaneTopology
type, it will decide drain action.
MCD skips drain if controlPlaneTopology is SingleReplica.
In any other cases, it will default to perform drain.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2021
@sinnykumari
Copy link
Contributor Author

/test e2e-aws-single-node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you use the dn.Node reference instead, the linter is complaining the node variable isn't used anymore here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fixed it

Refactored drain logic and moved drain related
functions into drain.go for easier maintenance.
Updated existing tests and added new test where
node controller reads ControlPlaneTopology value
from controllerConfig and compares with annotation
`machineconfiguration.openshift.io/controlPlaneTopology`
value set on the node.
@sinnykumari
Copy link
Contributor Author

unit test seems to be failing due to unrelated test
/retest

@sinnykumari
Copy link
Contributor Author

/test e2e-aws-single-node

@sinnykumari
Copy link
Contributor Author

/retest

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Just some minor questions/nits.

Also, let's say a user has a default 3-3 cluster. If they set that controllerconfig spec, would they be able to skip drains on their updates? Or does something overwrite that?

One last question: I wonder if ensureControllerConfigSpec function needs to be updated or not

like the web console to tell users where to find the Kubernetes
API.
type: string
controlPlaneTopology:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is controlPlaneTopology coming from? I see in the commit message:
ControllerConfig now understands ControlPlaneTopology:TopologyMode set in the cluster.
But I don't see how we're syncing that into the controlPlaneTopology field in the controllerconfig, maybe I'm missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncMachineConfigController() calls resourceapply.ApplyControllerConfig that calls resourcemerge.EnsureControllerConfig which calls ensureControllerConfigSpec() and here we check if Infra object has been modified and update controllerconfig if changed . Infra object is the one where controlPlaneTopology value exist as well along with other data https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L86 . MCO was already reading and updating Infra content, so adding controlPlaneToplogy field in crd populates its value as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And operator reads infrastructure object from cluster at https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L220 and syncs the controllerconfigspec . There are too many things to followup...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, ok thanks for the explanation! Right it gets synced as part of the infra field. Not the clearest of code paths 🤦

@sinnykumari
Copy link
Contributor Author

Overall looks good! Just some minor questions/nits.

Also, let's say a user has a default 3-3 cluster. If they set that controllerconfig spec, would they be able to skip drains on their updates? Or does something overwrite that?

No, MCO operator pod keeps resyncing value of controllerConfigspec and we read data from infrastructure cluster object, so in next sync it will revert the value to whatever is set cluster wide.

One last question: I wonder if ensureControllerConfigSpec function needs to be updated or not

not needed, as we already do that at

if required.Infra != nil && !equality.Semantic.DeepEqual(existing.Infra, required.Infra) {
.

@sinnykumari
Copy link
Contributor Author

log from a SNO cluster created using clusterbot and applied a MachineConfig:

I0310 10:38:15.014517    8164 update.go:543] Checking Reconcilable for config rendered-master-dd9d479a34df41e1ba15683557aac98f to rendered-master-032640efd4eaabc7a384aa49328dc323
I0310 10:38:15.115007    8164 update.go:1851] Starting update from rendered-master-dd9d479a34df41e1ba15683557aac98f to rendered-master-032640efd4eaabc7a384aa49328dc323: &{osUpdate:false kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:true}
I0310 10:38:15.149990    8164 update.go:1851] Node has been successfully cordoned
I0310 10:38:15.158087    8164 update.go:1851] Drain not required, skipping
I0310 10:38:15.163018    8164 update.go:1166] Updating files

@sinnykumari
Copy link
Contributor Author

/test e2e-aws-single-node

Also, made some minor fixes and spell checks
@sinnykumari
Copy link
Contributor Author

Overall looks good! Just some minor questions/nits.
Also, let's say a user has a default 3-3 cluster. If they set that controllerconfig spec, would they be able to skip drains on their updates? Or does something overwrite that?

No, MCO operator pod keeps resyncing value of controllerConfigspec and we read data from infrastructure cluster object, so in next sync it will revert the value to whatever is set cluster wide.

For double safety, confirmed same on a regular HA cluster. Updated controlPlaneTopology value in controllerconfig from HighlyAvailable to SingleReplica. Value got reverted back to HighlyAvailable

@sinnykumari
Copy link
Contributor Author

/retest
/test e2e-aws-single-node

@sinnykumari
Copy link
Contributor Author

gcp-op failed due to unrelated reason, retesting
/test e2e-gcp-op

@sinnykumari
Copy link
Contributor Author

Talked to SNO team, aws-single-node test is good from MCO side. Failing sub-tests should pass with openshift/origin#25936 .

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is looking good! Just a few more minor questions/nits. Approving 🎉

Will try to do some manual testing as well as give others a chance to provide feedback before LGTM

f.run(getKey(mcp, t))
}

func TestControlPlaneTopology(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a bit confused by what this is trying to test. Is it checking to see if the controllerconfig topology propagates to node annotations? Where is that being checked?

Copy link
Contributor Author

@sinnykumari sinnykumari Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are currently checking that with a valid controlPlaneToplogy i.e. SingleReplica , setClusterConfigAnnotation() gets called and works as expected. Since, in this unit test we have created controllerConfig with SingleReplica and we have set node annotation machineconfiguration.openshift.io/controlPlaneTopology to SingleReplica , expected result is no node change action such as patch.

Later on, I am thinking of extending the test to also perform machineconfiguration.openshift.io/controlPlaneTopology annotation change action. Adding this is a bit tricky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sorry for the dumb question, but I'm still not sure how we are testing that. If an action change happens as a bug, which line of code would return the error? I see we do the same for the above TestShouldDoNothing but I don't see any expectations for "nothing"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when in test run() calls runController(), node-controller syncHandler(pool) gets called at https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller_test.go#L114 . As you know when this runs, various actions can be generated based on what all operation on node has been performed. In test we filter out list and watch actionhttps://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller_test.go#L121as these actions doesn't really update anything. For other actions like patch, additional action will be check in later line during checkAction()

}

// We are here, that means we need to cordon and drain node
MCDDrainErr.WithLabelValues(dn.node.Name, "").Set(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why do we set this here again? Is it to clear a previously failing drain's error if that one hit the global timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, I am not sure. I kept logic from earlier implementation. @kikisdeliveryservice would know better.

@yuqi-zhang
Copy link
Contributor

Tried to do some manual testing but failed, will try again next week and LGTM if nobody else has any concerns

@yuqi-zhang
Copy link
Contributor

I think this looks good! Let's get this in, and iterate from there if there are any issues.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sinnykumari
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sinnykumari
Copy link
Contributor Author

skipping optional test from retest
/skip

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sinnykumari
Copy link
Contributor Author

/skip

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sinnykumari
Copy link
Contributor Author

/retest

1 similar comment
@sinnykumari
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2021

@sinnykumari: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 1261271 link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-single-node 1261271 link /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit eb56dc8 into openshift:master Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants