Skip to content

Graceful deletion of a VirtualMachineInstance#145

Merged
k8s-ci-robot merged 6 commits into
kubernetes-sigs:mainfrom
nunnatsa:drain-node
Aug 3, 2022
Merged

Graceful deletion of a VirtualMachineInstance#145
k8s-ci-robot merged 6 commits into
kubernetes-sigs:mainfrom
nunnatsa:drain-node

Conversation

@nunnatsa
Copy link
Copy Markdown
Contributor

@nunnatsa nunnatsa commented Jun 30, 2022

When a host node is deleted, the guest VMIs are evicted, killing the
running tenant node in this VMI.

This PR gracefully delete the VMI, by first drain the guest node. Only
if the drain successful, CAPK will delete the VMI.

The PR is based on a feature in KubeVirt, where by setting the eviction
strategy to "external" KubeVirt will not delete the VMI, but will set
the vmi.Status.EvacuationNodeName field to signal to the external
controller (CAPK in this case) that the VMI should be evicted.

Known Issues

Signed-off-by: Nahshon Unna-Tsameret nunnatsa@redhat.com

Release notes:

Added graceful eviction of a VMI

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 30, 2022
@k8s-ci-robot k8s-ci-robot requested review from justinsb and rmohr June 30, 2022 07:06
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2022
@nunnatsa nunnatsa force-pushed the drain-node branch 2 times, most recently from 0658eb0 to 14fabcf Compare June 30, 2022 07:47
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2022
@coveralls
Copy link
Copy Markdown

coveralls commented Jun 30, 2022

Pull Request Test Coverage Report for Build 2785024320

  • 119 of 174 (68.39%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.8%) to 51.311%

Changes Missing Coverage Covered Lines Changed/Added Lines %
controllers/vmi_eviction_controller.go 119 174 68.39%
Totals Coverage Status
Change from base Build 2776254636: 1.8%
Covered Lines: 939
Relevant Lines: 1830

💛 - Coveralls

Copy link
Copy Markdown
Contributor

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start!

I left a few comments in line and have one kind of higher level comment.

We need a drain timeout, where if a drain doesn't complete after x minutes (regardless of how many times drain is retried), we delete the VMI regardless. This timeout is meant to ensure that we won't block the removal of a VMI indefinitely when a infra node is being drained.

Comment thread controllers/vmi_controller.go Outdated
// KubeVirt will set the EvacuationNodeName field in case of guest node eviction. If the field is not set, there is
// nothing to do.
nodeName := vmi.Status.EvacuationNodeName
if len(nodeName) == 0 { // no need to drain
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the nodeName != "" check here, we should also have EvictionStrategy == External.

This is kind of a forward looking check that ensures in the future we only attempt to drain nodes that aren't live migratable.

Comment thread controllers/vmi_controller.go Outdated
Comment on lines +73 to +75
// now, when the node is drained, we can safely delete the VMI
propagationPolicy := metav1.DeletePropagationForeground
err = r.Delete(ctx, vmi, &client.DeleteOptions{PropagationPolicy: &propagationPolicy})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to wrap this section in a if vmi.DeletionTimestamp == nil and only delete if the VMI is not already marked for deletion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, can be validated even before calling to drainNode

Comment thread controllers/vmi_controller.go Outdated
Comment on lines +77 to +79
if !apierrors.IsAlreadyExists(err) {
return ctrl.Result{RequeueAfter: 20 * time.Second}, err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we check for the DeletionTimestamp, we shouldn't have to qualify what error occurred here. We also shouldn't need the RequeueAfter.

Comment thread controllers/vmi_controller.go Outdated
@@ -0,0 +1,212 @@
package controllers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need to give more meaningful name to this file, something like 'vmievacuation_controller.go

Comment thread controllers/vmi_controller.go Outdated
return ctrl.Result{}, err
}

nodeDrained, retryDuration, err := r.drainNode(ctx, cluster, nodeName, logger)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there should be cases returning nodeDrained=false instead of just returning the error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there should be cases returning nodeDrained=false instead of just returning the error?

Because for most cases (but not all of them), we don't want to retry. returning error with the result will cause retry. The calling function can't know which error to return with the result and which error to ignore.

Comment thread controllers/vmi_controller.go Outdated
Comment on lines +73 to +75
// now, when the node is drained, we can safely delete the VMI
propagationPolicy := metav1.DeletePropagationForeground
err = r.Delete(ctx, vmi, &client.DeleteOptions{PropagationPolicy: &propagationPolicy})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, can be validated even before calling to drainNode

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 4, 2022
Copy link
Copy Markdown
Contributor

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff, i left one comment in line.

Have you thought about how to e2e test this?

Comment on lines +193 to +216
if err = kubedrain.RunNodeDrain(drainer, node.Name); err != nil {
// Machine will be re-reconciled after a drain failure.
logger.Error(err, "Drain failed, retry in 20s", "node name", nodeName)
return false, 20 * time.Second, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a global timeout for drain as well. Something that ensures we eventually give up attempting to drain and delete the VMI after x minutes. 10 minutes would likely be a pretty conservative number to use.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an annotation on the VMI could be used to track when the drain begins. The timeout could be anchored to the start time recorded in the annotatino.

@nunnatsa
Copy link
Copy Markdown
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jul 12, 2022
@nunnatsa nunnatsa force-pushed the drain-node branch 10 times, most recently from cf1e68c to 61f8a47 Compare July 13, 2022 13:56
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2022
@nunnatsa nunnatsa force-pushed the drain-node branch 8 times, most recently from 24ff909 to a1ec316 Compare July 31, 2022 10:54
@nunnatsa nunnatsa changed the title WIP: Graceful deletion of a VirtualMachineInstance Graceful deletion of a VirtualMachineInstance Jul 31, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2022
Comment thread e2e/create-cluster_test.go Outdated
})
}).WithTimeout(time.Minute*5).
WithPolling(time.Second*10).
Should(Succeed(), "pod eviction failed")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's okay for eviction to fail, this is what we check for in the kubevirt e2e test that exercises similar behavior

			err = virtClient.CoreV1().Pods(vmi.Namespace).EvictV1beta1(context.Background(), &policyv1beta1.Eviction{ObjectMeta: metav1.ObjectMeta{Name: pod.Name}})
			// The "too many requests" err is what get's returned when an
			// eviction would invalidate a pdb. This is what we want to see here.
			Expect(errors.IsTooManyRequests(err)).To(BeTrue())

https://github.com/kubevirt/kubevirt/blob/835683a546d123774c99787c55aa6cdf28d5fabd/tests/vmi_lifecycle_test.go#L227

Comment thread e2e/create-cluster_test.go Outdated
Comment on lines +817 to +818
vmiName := chosenVMI.Name
vmiUID := chosenVMI.GetUID()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it confusing to declare a value for this vmiUID and then have it immediately overwritten 3 lines down.

Comment thread e2e/create-cluster_test.go Outdated
DeleteOptions: &metav1.DeleteOptions{
GracePeriodSeconds: pointer.Int64(60 * 10), // 10 minutes
},
})).ShouldNot(Succeed())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be accurate, we're looking for a specific error here, Expect(errors.IsTooManyRequests(err)).To(BeTrue()).

Comment thread e2e/create-cluster_test.go
Comment on lines +930 to +914
vmi, err = virtClient.VirtualMachineInstance(namespace).Get(vmiName, &metav1.GetOptions{})
g.Expect(err).ShouldNot(HaveOccurred())

vmiDebugPrintout(vmi)

g.Expect(vmi.Status.EvacuationNodeName).ShouldNot(BeEmpty())
g.Expect(vmi.DeletionTimestamp).ShouldNot(BeNil())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a controller that is deleting a VMI when evacuation node name is set, I think you should put a finalizer on the vmi in order to reliably observe EvacuationNodeName. otherwise it's possible the VMI could disappear before we see it in the functional test.

Comment thread e2e/create-cluster_test.go Outdated
Comment on lines +838 to +867
By("Read the worker node from the tenant cluster, and validate its IP")
Eventually(func(g Gomega) bool {
// reading the node and the VMI again and again, because it takes time to the IPs to be synchronized
node, err := clientSet.CoreV1().Nodes().Get(context.Background(), vmiName, metav1.GetOptions{})
g.Expect(err).ToNot(HaveOccurred())

var nodeIp string
for _, address := range node.Status.Addresses {
if address.Type == "InternalIP" {
nodeIp = address.Address
}
}

g.Expect(nodeIp).ShouldNot(BeEmpty(), "node's IP is not set")

vmi, err := virtClient.VirtualMachineInstance(namespace).Get(chosenVMI.Name, &metav1.GetOptions{})

g.Expect(err).ShouldNot(HaveOccurred())
g.Expect(vmi).ShouldNot(BeNil())

for _, ifs := range vmi.Status.Interfaces {
for _, ip := range ifs.IPs {
if ip == nodeIp {
return true
}
}
}
return false

}).WithTimeout(5 * time.Minute).WithPolling(10 * time.Second).Should(BeTrue())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would make sense to move this to a helper function

Comment thread controllers/vmi_eviction_controller.go Outdated
Comment on lines +86 to +100
nodeDrained, retryDuration, err := r.drainNode(ctx, cluster, vmi.Status.EvacuationNodeName, logger)
if err != nil {
return ctrl.Result{RequeueAfter: retryDuration}, err
}

if !nodeDrained && r.waitingForTimeout(ctx, vmi, logger) {
return ctrl.Result{RequeueAfter: retryDuration}, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drain can block indefinitely due to the way this logic is structured. The timeout needs to be checked either regardless if the drainNode() function returns an error or before drainNode is even called. If the global timeout occurs, the VMI should be deleted.

We don't want anything within the guest cluster to indefinitely block an infra cluster node from being drained.

When a host node is deleted, the guest VMIs are evicted, killing the
running tenant node in this VMI.

This PR gracefully delete the VMI, by first drain the guest node. Only
if the drain successful, CAPK will delete the VMI.

The PR is based on a feature in KubeVirt, where by setting the eviction
strategy to "external" KubeVirt will not delete the VMI, but will set
the `vmi.Status.EvacuationNodeName` field to signal to the external
controller (CAPK in this case) that the VMI should be evicted.

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
Also, add unit tests. Needed to changed the logic to get the guest
cluster client, to use the workloadClient type, because the previous
code was not testable.

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
If drain is failing for more than 10 minutes, delete the VMI anyway.

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
@nunnatsa nunnatsa force-pushed the drain-node branch 4 times, most recently from e7dfa69 to b1e5a0b Compare August 2, 2022 19:29
Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
Copy link
Copy Markdown
Contributor

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

excellent work 👍

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 3, 2022
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [davidvossel,nunnatsa]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants