WIP: Check MCO node state before trigger a reboot by pliurh · Pull Request #89 · k8snetworkplumbingwg/sriov-network-operator

pliurh · 2021-03-10T07:16:59Z

No description provided.

fedepaol · 2021-03-10T16:51:55Z

+		if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) {
+			if utils.ClusterType == utils.ClusterTypeOpenshift {
+				cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
+				dmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]


mcdconst.DesiredMachineConfigAnnotationKey here?

fedepaol · 2021-03-10T16:53:09Z

+			}
+
+			glog.Info("nodeStateSyncHandler(): reboot node")
+			rebootNode()


Not sure if because the pr is still WIP, so feel free to ignore the comment. Should pause the pool here and unpause it afterwards?

No, the PR doesn't want to pause the pool. The target is only preventing SNO from interrupting MCD when it's configuring the node.

umm, I thought the solution we were looking for was for a regular cluster. Even for SNO case, if you don't pause the pool nothing is stopping from MCO rendering and applying a new MC if admin initiated one.

also, note that for SNO case, MCO is going to skip drain operation, see openshift/machine-config-operator#2457

Right, like sinny said SNO should be a separate consideration

It is ok for SNO to be interrupted by MCO. It can continue its work after MCO reboot the node. SNO doesn't support a single node cluster yet.

zshi-redhat · 2021-03-12T08:59:13Z

+				cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
+				dmc, _ := dn.node.Annotations[mcdconst.DesiredMachineConfigAnnotationKey]
+				mcdState, _ := dn.node.Annotations[mcdconst.MachineConfigDaemonStateAnnotationKey]
+				if cmc == "" || cmc != dmc || mcdState != mcdconst.MachineConfigDaemonStateDone {


So when MCO is degraded on the node, sriov configuration will not be applied until MCO is fixed.

zshi-redhat · 2021-03-12T09:04:03Z

+			}
+
+			glog.Info("nodeStateSyncHandler(): reboot node")
+			rebootNode()


If MCO happens to apply kernel arguments while SR-IOV requesting reboot for its iommu change, will SR-IOV's change be override by MCO's change?

Will MCO become degraded when other components applying kernel argument change?

If MCO happens to apply kernel arguments while SR-IOV requesting reboot for its iommu change, will SR-IOV's change be override by MCO's change?

I thought we were pausing pools, such that the MCO would operate after SRIOV completes. In that case the MCO wouldn't touch any kargs it doesn't have listed.

Will MCO become degraded when other components applying kernel argument change?

The MCO does not validate kargs I believe, so it won't degrade.

The pausing MCO approach requires SNO to make a big amount of change. It's hard to coordinate the pause/resume operator among different nodes. There is a big difference between MCO and SRO is that the controller of SRO has no control over the order of the node reboot.

The pausing MCO approach requires SNO to make a big amount of change. It's hard to coordinate the pause/resume operator among different nodes. There is a big difference between MCO and SRO is that the controller of SRO has no control over the order of the node reboot.

Does the controller know when the nodes are done? Would the controller be able to pause the pool, let the nodes do their thing, and then unpause once the nodes finish? Or does the controller not have insight into when the nodes are done?

No, the controller doesn't know when the nodes are done. It only tells the daemon the desired state of the node. The config daemon will update the status of the node SriovNetworkNodeState CR, when it's done. But the controller doesn't check that.

zshi-redhat · 2021-03-12T09:10:38Z

-		rebootNode()
+		if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) {
+			if utils.ClusterType == utils.ClusterTypeOpenshift {
+				cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]


Do we want to do the same for drainNode logic?

zshi-redhat · 2021-03-12T09:25:23Z

+				cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
+				dmc, _ := dn.node.Annotations[mcdconst.DesiredMachineConfigAnnotationKey]
+				mcdState, _ := dn.node.Annotations[mcdconst.MachineConfigDaemonStateAnnotationKey]
+				if cmc == "" || cmc != dmc || mcdState != mcdconst.MachineConfigDaemonStateDone {


@yuqi-zhang @sinnykumari could you review if we are using the MCO conditional check properly?

Is it possible, instead of using per node status, we instead use the pool status?
By doing just per node, you risk violating MCP's maxUnavailable, since a separate node could be updating due to the MCO.

The per node status checks is not fully equivalent to the MCO, if we want to match the "unavailable" definition of the MCO. Also, you probably need to re-fetch the node object here via an API call, since otherwise you're polling on an old object right?

I've changed the logic to check all the nodes of the cluster. So it will wait until MCO finishes all its work across the cluster.

There's a node informer which can guaranty the node object is up-to-date.

yuqi-zhang · 2021-03-12T20:32:12Z

Although I don't have much understanding into the SRIOV daemon, I think based on our discussion the logic could look something like:

the SRIOV daemon receives a request from its controller to perform an update
the SRIOV daemon checks the machineconfigpool the node it owns belongs to, and polls whether the pool is progressing. If it is, continue polling
if the pool is ready, pause the pool
check pool status again. if the pool is progressing but paused, unpause the pool and go back to step 2
if the pool is ready (not progressing and paused), perform SRIOV daemon operations
once rebooted, unpause the pool

This probably guarantees maximum safety for both operators. Operating on a per-node basis is very dangerous.

pliurh · 2021-03-15T08:18:15Z

Although I don't have much understanding into the SRIOV daemon, I think based on our discussion the logic could look something like:

the SRIOV daemon receives a request from its controller to perform an update

the SRIOV daemon checks the machineconfigpool the node it owns belongs to, and polls whether the pool is progressing. If it is, continue polling

if the pool is ready, pause the pool

check pool status again. if the pool is progressing but paused, unpause the pool and go back to step 2

if the pool is ready (not progressing and paused), perform SRIOV daemon operations

once rebooted, unpause the pool

This probably guarantees maximum safety for both operators. Operating on a per-node basis is very dangerous.

The code has been updated to checking the status of all the nodes. I think it will be problematic having the pause/resume pool logic on each node. If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix.

aojea · 2021-03-18T08:44:13Z

-		rebootNode()
+		if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) {
+			if utils.ClusterType == utils.ClusterTypeOpenshift {
+				for _, node := range dn.nodes {


you can use the nodes, err = dn.nodeLister.List(labels.Everything()) here to obtained the nodes, that List is obtained from the local cache from the informer, that is shared. I think that is better than sharing the list on the object.

hmm, maybe I shall check the MCP status here instead. During my test, I found during MCP update, after MCD finishes its job in one node, there is a short period of time that, all the nodes' machine config status is 'Done' before MCD starts configuring another node. So checking the node annotation here is not so reliable.

yuqi-zhang · 2021-03-19T15:44:36Z

The code has been updated to checking the status of all the nodes. I think it will be problematic having the pause/resume pool logic on each node.

I agree with this. It seems that based on the SRIOV operation, it's not very one-to-one with the MCO on how it thinks of the controller-daemon dynamic, which seems slightly problematic for our proposed fix

If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix.

I don't think this fully can. The PR today violates two things (I believe):

the maxUnavailable guarantee of pools, provided by the MCC. Because you only check for per-node status, you can kick off an update on a node while the MCO is operating on another.
there is still race conditions possible, where you check the node status, starts your update, and the MCD immediately starts its update. While SRIOV is checking MCO, MCO is not checking SRIOV, and we may still run into the same problems before. This is why pausing the pool is important as the SRIOV is signalling the MCO to not proceed

An alternative proposal is maybe just to document to users that they MUST pause the MCPs before configuring any SRIOV changes, and must unpause after? Obviously dangerous to leave it up to the user but at least we document a way that's "safe"

smarterclayton · 2021-03-19T20:05:17Z

In general, in the platform, no one except MCO should be rebooting machines except a human. It's acceptable to let the customer decide to let SRIOV restart the node, but we don't want components rebooting nodes in general purpose use. So yes, telling SRIOV consumers they should pause the MCP is acceptible because SRIOV rebooting is "customer use case specific".

pliurh · 2021-03-22T08:30:05Z

If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix.

I don't think this fully can. The PR today violates two things (I believe):

the maxUnavailable guarantee of pools, provided by the MCC. Because you only check for per-node status, you can kick off an update on a node while the MCO is operating on another.

I've updated this PR to checking the status of all the nodes in the cluster. The SNO will not reboot a node when MCO is processing on any nodes in the cluster.

there is still race conditions possible, where you check the node status, starts your update, and the MCD immediately starts its update. While SRIOV is checking MCO, MCO is not checking SRIOV, and we may still run into the same problems before. This is why pausing the pool is important as the SRIOV is signalling the MCO to not proceed

It could happen. However, I think the possibility will be very low. As the node will be rebooted immediately after the checking, the time window left for MCD is very small.
The pausing may have to face the same racing problem. Before pausing an MCP, we still need to check whether it is processing or not. And there is still a possibility that an MCP start processing right before SNO pauses the MCP.

An alternative proposal is maybe just to document to users that they MUST pause the MCPs before configuring any SRIOV changes, and must unpause after? Obviously dangerous to leave it up to the user but at least we document a way that's "safe"

pliurh · 2021-03-22T08:48:33Z

In general, in the platform, no one except MCO should be rebooting machines except a human. It's acceptable to let the customer decide to let SRIOV restart the node, but we don't want components rebooting nodes in general purpose use. So yes, telling SRIOV consumers they should pause the MCP is acceptible because SRIOV rebooting is "customer use case specific".

That is also my first thought.

However, the telco team told me:

They want to the process of configuring the cluster with minimal/no manual interaction. So they want the MachineConfig and SriovNetworkNodePolicy CRs can be applied in one shoot.
This could also happen when scaling new workers which need config both SRIOV and custom MachineConfig.

So, the target of this PR is to provide short-term mitigation before MCO providing the reboot API which can be consumed by SNO and other components.

sinnykumari · 2021-03-22T09:21:10Z

The pausing may have to face the same racing problem. Before pausing an MCP, we still need to check whether it is processing or not. And there is still a possibility that an MCP start processing right before SNO pauses the MCP.

With the current implementation, sriov daemon is first fetching state of all nodes in cluster and then iterating over all nodes and checks for states. Chances of error increases with number of nodes in the cluster.

Looks like there are some confusion going on with implementation, what we need to do is:

sriov daemon running on node came to know that a node specific config needs to be applied. like kargs
sriov daemon running on the node checks to which mcp the node belongs to. e.g. worker pool
It fetches current state of worker pool and checks pool status field (in our e.g. worker pool) machineCount == readyMachineCount . Note that we are checking state of only pool to which node belongs.
If step 3 is true, Pause the pool (in our e.g. worker pool) to which that node belongs. You would want to check here current value of pause field so that accidentally you don't unpause the pool in step 6 if cluster admin has also explicitly set it.
Now Cordon -> drain -> apply config -> reboot node -> uncordon the node
Check the pause state that you saved in step 4, if sriov daemon paused the pool, unpuase the pool otherwise do nothing and consider it as done.

pliurh · 2021-03-23T03:15:46Z

I proposed #93 which follows the pause MCP approach.

pliurh force-pushed the upstream branch from 8454be2 to 549d94e Compare March 10, 2021 07:25

fedepaol reviewed Mar 10, 2021

View reviewed changes

pliurh force-pushed the upstream branch from 549d94e to f40d5ac Compare March 12, 2021 02:24

zshi-redhat reviewed Mar 12, 2021

View reviewed changes

pliurh force-pushed the upstream branch from f40d5ac to 85e8182 Compare March 15, 2021 07:50

pliurh force-pushed the upstream branch from 85e8182 to e1520a4 Compare March 15, 2021 08:22

Check MCO node state before trigger a reboot

c35d687

pliurh force-pushed the upstream branch from e1520a4 to c35d687 Compare March 15, 2021 08:26

aojea reviewed Mar 18, 2021

View reviewed changes

pliurh closed this Mar 31, 2021

Conversation

pliurh commented Mar 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pliurh commented Mar 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Mar 19, 2021

Uh oh!

smarterclayton commented Mar 19, 2021

Uh oh!

pliurh commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pliurh commented Mar 22, 2021

Uh oh!

sinnykumari commented Mar 22, 2021

Uh oh!

pliurh commented Mar 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yuqi-zhang Mar 12, 2021 •

edited

Loading

yuqi-zhang commented Mar 12, 2021 •

edited

Loading

pliurh commented Mar 22, 2021 •

edited

Loading