Skip to content

WIP: Check MCO node state before trigger a reboot#89

Closed
pliurh wants to merge 1 commit into
k8snetworkplumbingwg:masterfrom
pliurh:upstream
Closed

WIP: Check MCO node state before trigger a reboot#89
pliurh wants to merge 1 commit into
k8snetworkplumbingwg:masterfrom
pliurh:upstream

Conversation

@pliurh
Copy link
Copy Markdown
Collaborator

@pliurh pliurh commented Mar 10, 2021

No description provided.

Comment thread pkg/daemon/daemon.go Outdated
if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) {
if utils.ClusterType == utils.ClusterTypeOpenshift {
cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
dmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcdconst.DesiredMachineConfigAnnotationKey here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment thread pkg/daemon/daemon.go
}

glog.Info("nodeStateSyncHandler(): reboot node")
rebootNode()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if because the pr is still WIP, so feel free to ignore the comment. Should pause the pool here and unpause it afterwards?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the PR doesn't want to pause the pool. The target is only preventing SNO from interrupting MCD when it's configuring the node.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, I thought the solution we were looking for was for a regular cluster. Even for SNO case, if you don't pause the pool nothing is stopping from MCO rendering and applying a new MC if admin initiated one.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, note that for SNO case, MCO is going to skip drain operation, see openshift/machine-config-operator#2457

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, like sinny said SNO should be a separate consideration

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok for SNO to be interrupted by MCO. It can continue its work after MCO reboot the node. SNO doesn't support a single node cluster yet.

Comment thread pkg/daemon/daemon.go Outdated
cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
dmc, _ := dn.node.Annotations[mcdconst.DesiredMachineConfigAnnotationKey]
mcdState, _ := dn.node.Annotations[mcdconst.MachineConfigDaemonStateAnnotationKey]
if cmc == "" || cmc != dmc || mcdState != mcdconst.MachineConfigDaemonStateDone {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when MCO is degraded on the node, sriov configuration will not be applied until MCO is fixed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Comment thread pkg/daemon/daemon.go
}

glog.Info("nodeStateSyncHandler(): reboot node")
rebootNode()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MCO happens to apply kernel arguments while SR-IOV requesting reboot for its iommu change, will SR-IOV's change be override by MCO's change?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will MCO become degraded when other components applying kernel argument change?

Copy link
Copy Markdown

@yuqi-zhang yuqi-zhang Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MCO happens to apply kernel arguments while SR-IOV requesting reboot for its iommu change, will SR-IOV's change be override by MCO's change?

I thought we were pausing pools, such that the MCO would operate after SRIOV completes. In that case the MCO wouldn't touch any kargs it doesn't have listed.

Will MCO become degraded when other components applying kernel argument change?

The MCO does not validate kargs I believe, so it won't degrade.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pausing MCO approach requires SNO to make a big amount of change. It's hard to coordinate the pause/resume operator among different nodes. There is a big difference between MCO and SRO is that the controller of SRO has no control over the order of the node reboot.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pausing MCO approach requires SNO to make a big amount of change. It's hard to coordinate the pause/resume operator among different nodes. There is a big difference between MCO and SRO is that the controller of SRO has no control over the order of the node reboot.

Does the controller know when the nodes are done? Would the controller be able to pause the pool, let the nodes do their thing, and then unpause once the nodes finish? Or does the controller not have insight into when the nodes are done?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the controller doesn't know when the nodes are done. It only tells the daemon the desired state of the node. The config daemon will update the status of the node SriovNetworkNodeState CR, when it's done. But the controller doesn't check that.

Comment thread pkg/daemon/daemon.go Outdated
rebootNode()
if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) {
if utils.ClusterType == utils.ClusterTypeOpenshift {
cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do the same for drainNode logic?

Comment thread pkg/daemon/daemon.go Outdated
cmc, _ := dn.node.Annotations[mcdconst.CurrentMachineConfigAnnotationKey]
dmc, _ := dn.node.Annotations[mcdconst.DesiredMachineConfigAnnotationKey]
mcdState, _ := dn.node.Annotations[mcdconst.MachineConfigDaemonStateAnnotationKey]
if cmc == "" || cmc != dmc || mcdState != mcdconst.MachineConfigDaemonStateDone {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang @sinnykumari could you review if we are using the MCO conditional check properly?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible, instead of using per node status, we instead use the pool status?
By doing just per node, you risk violating MCP's maxUnavailable, since a separate node could be updating due to the MCO.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per node status checks is not fully equivalent to the MCO, if we want to match the "unavailable" definition of the MCO. Also, you probably need to re-fetch the node object here via an API call, since otherwise you're polling on an old object right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the logic to check all the nodes of the cluster. So it will wait until MCO finishes all its work across the cluster.

There's a node informer which can guaranty the node object is up-to-date.

@yuqi-zhang
Copy link
Copy Markdown

yuqi-zhang commented Mar 12, 2021

Although I don't have much understanding into the SRIOV daemon, I think based on our discussion the logic could look something like:

  1. the SRIOV daemon receives a request from its controller to perform an update
  2. the SRIOV daemon checks the machineconfigpool the node it owns belongs to, and polls whether the pool is progressing. If it is, continue polling
  3. if the pool is ready, pause the pool
  4. check pool status again. if the pool is progressing but paused, unpause the pool and go back to step 2
  5. if the pool is ready (not progressing and paused), perform SRIOV daemon operations
  6. once rebooted, unpause the pool

This probably guarantees maximum safety for both operators. Operating on a per-node basis is very dangerous.

@pliurh
Copy link
Copy Markdown
Collaborator Author

pliurh commented Mar 15, 2021

Although I don't have much understanding into the SRIOV daemon, I think based on our discussion the logic could look something like:

  1. the SRIOV daemon receives a request from its controller to perform an update
  2. the SRIOV daemon checks the machineconfigpool the node it owns belongs to, and polls whether the pool is progressing. If it is, continue polling
  3. if the pool is ready, pause the pool
  4. check pool status again. if the pool is progressing but paused, unpause the pool and go back to step 2
  5. if the pool is ready (not progressing and paused), perform SRIOV daemon operations
  6. once rebooted, unpause the pool

This probably guarantees maximum safety for both operators. Operating on a per-node basis is very dangerous.

The code has been updated to checking the status of all the nodes. I think it will be problematic having the pause/resume pool logic on each node. If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix.

Comment thread pkg/daemon/daemon.go
rebootNode()
if err := wait.PollImmediateUntil(3*time.Second, func() (bool, error) {
if utils.ClusterType == utils.ClusterTypeOpenshift {
for _, node := range dn.nodes {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the nodes, err = dn.nodeLister.List(labels.Everything()) here to obtained the nodes, that List is obtained from the local cache from the informer, that is shared. I think that is better than sharing the list on the object.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe I shall check the MCP status here instead. During my test, I found during MCP update, after MCD finishes its job in one node, there is a short period of time that, all the nodes' machine config status is 'Done' before MCD starts configuring another node. So checking the node annotation here is not so reliable.

@yuqi-zhang
Copy link
Copy Markdown

The code has been updated to checking the status of all the nodes. I think it will be problematic having the pause/resume pool logic on each node.

I agree with this. It seems that based on the SRIOV operation, it's not very one-to-one with the MCO on how it thinks of the controller-daemon dynamic, which seems slightly problematic for our proposed fix

If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix.

I don't think this fully can. The PR today violates two things (I believe):

  1. the maxUnavailable guarantee of pools, provided by the MCC. Because you only check for per-node status, you can kick off an update on a node while the MCO is operating on another.
  2. there is still race conditions possible, where you check the node status, starts your update, and the MCD immediately starts its update. While SRIOV is checking MCO, MCO is not checking SRIOV, and we may still run into the same problems before. This is why pausing the pool is important as the SRIOV is signalling the MCO to not proceed

An alternative proposal is maybe just to document to users that they MUST pause the MCPs before configuring any SRIOV changes, and must unpause after? Obviously dangerous to leave it up to the user but at least we document a way that's "safe"

@smarterclayton
Copy link
Copy Markdown

In general, in the platform, no one except MCO should be rebooting machines except a human. It's acceptable to let the customer decide to let SRIOV restart the node, but we don't want components rebooting nodes in general purpose use. So yes, telling SRIOV consumers they should pause the MCP is acceptible because SRIOV rebooting is "customer use case specific".

@pliurh
Copy link
Copy Markdown
Collaborator Author

pliurh commented Mar 22, 2021

If we can make sure the SNO will not interrupt the MCD, we can call it a short-term fix.

I don't think this fully can. The PR today violates two things (I believe):

  1. the maxUnavailable guarantee of pools, provided by the MCC. Because you only check for per-node status, you can kick off an update on a node while the MCO is operating on another.

I've updated this PR to checking the status of all the nodes in the cluster. The SNO will not reboot a node when MCO is processing on any nodes in the cluster.

  1. there is still race conditions possible, where you check the node status, starts your update, and the MCD immediately starts its update. While SRIOV is checking MCO, MCO is not checking SRIOV, and we may still run into the same problems before. This is why pausing the pool is important as the SRIOV is signalling the MCO to not proceed

It could happen. However, I think the possibility will be very low. As the node will be rebooted immediately after the checking, the time window left for MCD is very small.
The pausing may have to face the same racing problem. Before pausing an MCP, we still need to check whether it is processing or not. And there is still a possibility that an MCP start processing right before SNO pauses the MCP.

An alternative proposal is maybe just to document to users that they MUST pause the MCPs before configuring any SRIOV changes, and must unpause after? Obviously dangerous to leave it up to the user but at least we document a way that's "safe"

@pliurh
Copy link
Copy Markdown
Collaborator Author

pliurh commented Mar 22, 2021

In general, in the platform, no one except MCO should be rebooting machines except a human. It's acceptable to let the customer decide to let SRIOV restart the node, but we don't want components rebooting nodes in general purpose use. So yes, telling SRIOV consumers they should pause the MCP is acceptible because SRIOV rebooting is "customer use case specific".

That is also my first thought.

However, the telco team told me:

  1. They want to the process of configuring the cluster with minimal/no manual interaction. So they want the MachineConfig and SriovNetworkNodePolicy CRs can be applied in one shoot.
  2. This could also happen when scaling new workers which need config both SRIOV and custom MachineConfig.

So, the target of this PR is to provide short-term mitigation before MCO providing the reboot API which can be consumed by SNO and other components.

@sinnykumari
Copy link
Copy Markdown

The pausing may have to face the same racing problem. Before pausing an MCP, we still need to check whether it is processing or not. And there is still a possibility that an MCP start processing right before SNO pauses the MCP.

With the current implementation, sriov daemon is first fetching state of all nodes in cluster and then iterating over all nodes and checks for states. Chances of error increases with number of nodes in the cluster.

Looks like there are some confusion going on with implementation, what we need to do is:

  1. sriov daemon running on node came to know that a node specific config needs to be applied. like kargs
  2. sriov daemon running on the node checks to which mcp the node belongs to. e.g. worker pool
  3. It fetches current state of worker pool and checks pool status field (in our e.g. worker pool) machineCount == readyMachineCount . Note that we are checking state of only pool to which node belongs.
  4. If step 3 is true, Pause the pool (in our e.g. worker pool) to which that node belongs. You would want to check here current value of pause field so that accidentally you don't unpause the pool in step 6 if cluster admin has also explicitly set it.
  5. Now Cordon -> drain -> apply config -> reboot node -> uncordon the node
  6. Check the pause state that you saved in step 4, if sriov daemon paused the pool, unpuase the pool otherwise do nothing and consider it as done.

@pliurh
Copy link
Copy Markdown
Collaborator Author

pliurh commented Mar 23, 2021

I proposed #93 which follows the pause MCP approach.

@pliurh pliurh closed this Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants