WIP: Check MCO node state before trigger a reboot#89
Conversation
| if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) { | ||
| if utils.ClusterType == utils.ClusterTypeOpenshift { | ||
| cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey] | ||
| dmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey] |
There was a problem hiding this comment.
mcdconst.DesiredMachineConfigAnnotationKey here?
| } | ||
|
|
||
| glog.Info("nodeStateSyncHandler(): reboot node") | ||
| rebootNode() |
There was a problem hiding this comment.
Not sure if because the pr is still WIP, so feel free to ignore the comment. Should pause the pool here and unpause it afterwards?
There was a problem hiding this comment.
No, the PR doesn't want to pause the pool. The target is only preventing SNO from interrupting MCD when it's configuring the node.
There was a problem hiding this comment.
umm, I thought the solution we were looking for was for a regular cluster. Even for SNO case, if you don't pause the pool nothing is stopping from MCO rendering and applying a new MC if admin initiated one.
There was a problem hiding this comment.
also, note that for SNO case, MCO is going to skip drain operation, see openshift/machine-config-operator#2457
There was a problem hiding this comment.
Right, like sinny said SNO should be a separate consideration
There was a problem hiding this comment.
It is ok for SNO to be interrupted by MCO. It can continue its work after MCO reboot the node. SNO doesn't support a single node cluster yet.
| cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey] | ||
| dmc, _ := dn.node.Annotations[mcdconst.DesiredMachineConfigAnnotationKey] | ||
| mcdState, _ := dn.node.Annotations[mcdconst.MachineConfigDaemonStateAnnotationKey] | ||
| if cmc == "" || cmc != dmc || mcdState != mcdconst.MachineConfigDaemonStateDone { |
There was a problem hiding this comment.
So when MCO is degraded on the node, sriov configuration will not be applied until MCO is fixed.
| } | ||
|
|
||
| glog.Info("nodeStateSyncHandler(): reboot node") | ||
| rebootNode() |
There was a problem hiding this comment.
If MCO happens to apply kernel arguments while SR-IOV requesting reboot for its iommu change, will SR-IOV's change be override by MCO's change?
There was a problem hiding this comment.
Will MCO become degraded when other components applying kernel argument change?
There was a problem hiding this comment.
If MCO happens to apply kernel arguments while SR-IOV requesting reboot for its iommu change, will SR-IOV's change be override by MCO's change?
I thought we were pausing pools, such that the MCO would operate after SRIOV completes. In that case the MCO wouldn't touch any kargs it doesn't have listed.
Will MCO become degraded when other components applying kernel argument change?
The MCO does not validate kargs I believe, so it won't degrade.
There was a problem hiding this comment.
The pausing MCO approach requires SNO to make a big amount of change. It's hard to coordinate the pause/resume operator among different nodes. There is a big difference between MCO and SRO is that the controller of SRO has no control over the order of the node reboot.
There was a problem hiding this comment.
The pausing MCO approach requires SNO to make a big amount of change. It's hard to coordinate the pause/resume operator among different nodes. There is a big difference between MCO and SRO is that the controller of SRO has no control over the order of the node reboot.
Does the controller know when the nodes are done? Would the controller be able to pause the pool, let the nodes do their thing, and then unpause once the nodes finish? Or does the controller not have insight into when the nodes are done?
There was a problem hiding this comment.
No, the controller doesn't know when the nodes are done. It only tells the daemon the desired state of the node. The config daemon will update the status of the node SriovNetworkNodeState CR, when it's done. But the controller doesn't check that.
| rebootNode() | ||
| if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) { | ||
| if utils.ClusterType == utils.ClusterTypeOpenshift { | ||
| cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey] |
There was a problem hiding this comment.
Do we want to do the same for drainNode logic?
| cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey] | ||
| dmc, _ := dn.node.Annotations[mcdconst.DesiredMachineConfigAnnotationKey] | ||
| mcdState, _ := dn.node.Annotations[mcdconst.MachineConfigDaemonStateAnnotationKey] | ||
| if cmc == "" || cmc != dmc || mcdState != mcdconst.MachineConfigDaemonStateDone { |
There was a problem hiding this comment.
@yuqi-zhang @sinnykumari could you review if we are using the MCO conditional check properly?
There was a problem hiding this comment.
Is it possible, instead of using per node status, we instead use the pool status?
By doing just per node, you risk violating MCP's maxUnavailable, since a separate node could be updating due to the MCO.
There was a problem hiding this comment.
The per node status checks is not fully equivalent to the MCO, if we want to match the "unavailable" definition of the MCO. Also, you probably need to re-fetch the node object here via an API call, since otherwise you're polling on an old object right?
There was a problem hiding this comment.
I've changed the logic to check all the nodes of the cluster. So it will wait until MCO finishes all its work across the cluster.
There's a node informer which can guaranty the node object is up-to-date.
|
Although I don't have much understanding into the SRIOV daemon, I think based on our discussion the logic could look something like:
This probably guarantees maximum safety for both operators. Operating on a per-node basis is very dangerous. |
The code has been updated to checking the status of all the nodes. I think it will be problematic having the pause/resume pool logic on each node. If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix. |
| rebootNode() | ||
| if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) { | ||
| if utils.ClusterType == utils.ClusterTypeOpenshift { | ||
| for _, node := range dn.nodes { |
There was a problem hiding this comment.
you can use the nodes, err = dn.nodeLister.List(labels.Everything()) here to obtained the nodes, that List is obtained from the local cache from the informer, that is shared. I think that is better than sharing the list on the object.
There was a problem hiding this comment.
hmm, maybe I shall check the MCP status here instead. During my test, I found during MCP update, after MCD finishes its job in one node, there is a short period of time that, all the nodes' machine config status is 'Done' before MCD starts configuring another node. So checking the node annotation here is not so reliable.
I agree with this. It seems that based on the SRIOV operation, it's not very one-to-one with the MCO on how it thinks of the controller-daemon dynamic, which seems slightly problematic for our proposed fix
I don't think this fully can. The PR today violates two things (I believe):
An alternative proposal is maybe just to document to users that they MUST pause the MCPs before configuring any SRIOV changes, and must unpause after? Obviously dangerous to leave it up to the user but at least we document a way that's "safe" |
|
In general, in the platform, no one except MCO should be rebooting machines except a human. It's acceptable to let the customer decide to let SRIOV restart the node, but we don't want components rebooting nodes in general purpose use. So yes, telling SRIOV consumers they should pause the MCP is acceptible because SRIOV rebooting is "customer use case specific". |
I've updated this PR to checking the status of all the nodes in the cluster. The SNO will not reboot a node when MCO is processing on any nodes in the cluster.
It could happen. However, I think the possibility will be very low. As the node will be rebooted immediately after the checking, the time window left for MCD is very small.
|
That is also my first thought. However, the telco team told me:
So, the target of this PR is to provide short-term mitigation before MCO providing the reboot API which can be consumed by SNO and other components. |
With the current implementation, sriov daemon is first fetching state of all nodes in cluster and then iterating over all nodes and checks for states. Chances of error increases with number of nodes in the cluster. Looks like there are some confusion going on with implementation, what we need to do is:
|
|
I proposed #93 which follows the pause MCP approach. |
No description provided.