Graceful deletion of a VirtualMachineInstance by nunnatsa · Pull Request #145 · kubernetes-sigs/cluster-api-provider-kubevirt

nunnatsa · 2022-06-30T07:06:12Z

When a host node is deleted, the guest VMIs are evicted, killing the
running tenant node in this VMI.

This PR gracefully delete the VMI, by first drain the guest node. Only
if the drain successful, CAPK will delete the VMI.

The PR is based on a feature in KubeVirt, where by setting the eviction
strategy to "external" KubeVirt will not delete the VMI, but will set
the vmi.Status.EvacuationNodeName field to signal to the external
controller (CAPK in this case) that the VMI should be evicted.

Known Issues

does not support external clusters (CAPK depends on KubeVirt running on the management cluster #100)

Signed-off-by: Nahshon Unna-Tsameret nunnatsa@redhat.com

Release notes:

Added graceful eviction of a VMI

coveralls · 2022-06-30T12:09:19Z

Pull Request Test Coverage Report for Build 2785024320

119 of 174 (68.39%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.8%) to 51.311%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
controllers/vmi_eviction_controller.go	119	174	68.39%

Totals
Change from base Build 2776254636:	1.8%
Covered Lines:	939
Relevant Lines:	1830

💛 - Coveralls

davidvossel

great start!

I left a few comments in line and have one kind of higher level comment.

We need a drain timeout, where if a drain doesn't complete after x minutes (regardless of how many times drain is retried), we delete the VMI regardless. This timeout is meant to ensure that we won't block the removal of a VMI indefinitely when a infra node is being drained.

davidvossel · 2022-06-30T15:52:03Z

+	// KubeVirt will set the EvacuationNodeName field in case of guest node eviction. If the field is not set, there is
+	// nothing to do.
+	nodeName := vmi.Status.EvacuationNodeName
+	if len(nodeName) == 0 { // no need to drain


In addition to the nodeName != "" check here, we should also have EvictionStrategy == External.

This is kind of a forward looking check that ensures in the future we only attempt to drain nodes that aren't live migratable.

davidvossel · 2022-06-30T15:56:05Z

+	// now, when the node is drained, we can safely delete the VMI
+	propagationPolicy := metav1.DeletePropagationForeground
+	err = r.Delete(ctx, vmi, &client.DeleteOptions{PropagationPolicy: &propagationPolicy})


I think it would make sense to wrap this section in a if vmi.DeletionTimestamp == nil and only delete if the VMI is not already marked for deletion.

Agree, can be validated even before calling to drainNode

davidvossel · 2022-06-30T15:56:57Z

+		if !apierrors.IsAlreadyExists(err) {
+			return ctrl.Result{RequeueAfter: 20 * time.Second}, err
+		}


if we check for the DeletionTimestamp, we shouldn't have to qualify what error occurred here. We also shouldn't need the RequeueAfter.

nirarg · 2022-07-03T06:38:48Z

@@ -0,0 +1,212 @@
+package controllers


I think need to give more meaningful name to this file, something like 'vmievacuation_controller.go

nirarg · 2022-07-03T06:51:56Z

+		return ctrl.Result{}, err
+	}
+
+	nodeDrained, retryDuration, err := r.drainNode(ctx, cluster, nodeName, logger)


Why there should be cases returning nodeDrained=false instead of just returning the error?

Why there should be cases returning nodeDrained=false instead of just returning the error?

Because for most cases (but not all of them), we don't want to retry. returning error with the result will cause retry. The calling function can't know which error to return with the result and which error to ignore.

nirarg · 2022-07-03T06:54:20Z

+	// now, when the node is drained, we can safely delete the VMI
+	propagationPolicy := metav1.DeletePropagationForeground
+	err = r.Delete(ctx, vmi, &client.DeleteOptions{PropagationPolicy: &propagationPolicy})


Agree, can be validated even before calling to drainNode

davidvossel

great stuff, i left one comment in line.

Have you thought about how to e2e test this?

davidvossel · 2022-07-06T20:44:19Z

+	if err = kubedrain.RunNodeDrain(drainer, node.Name); err != nil {
+		// Machine will be re-reconciled after a drain failure.
+		logger.Error(err, "Drain failed, retry in 20s", "node name", nodeName)
+		return false, 20 * time.Second, nil


we need a global timeout for drain as well. Something that ensures we eventually give up attempting to drain and delete the VMI after x minutes. 10 minutes would likely be a pretty conservative number to use.

an annotation on the VMI could be used to track when the drain begins. The timeout could be anchored to the start time recorded in the annotatino.

nunnatsa · 2022-07-12T12:21:28Z

/ok-to-test

davidvossel · 2022-07-25T14:41:34Z

+			})
+		}).WithTimeout(time.Minute*5).
+			WithPolling(time.Second*10).
+			Should(Succeed(), "pod eviction failed")


it's okay for eviction to fail, this is what we check for in the kubevirt e2e test that exercises similar behavior

err = virtClient.CoreV1().Pods(vmi.Namespace).EvictV1beta1(context.Background(), &policyv1beta1.Eviction{ObjectMeta: metav1.ObjectMeta{Name: pod.Name}}) // The "too many requests" err is what get's returned when an // eviction would invalidate a pdb. This is what we want to see here. Expect(errors.IsTooManyRequests(err)).To(BeTrue())

https://github.com/kubevirt/kubevirt/blob/835683a546d123774c99787c55aa6cdf28d5fabd/tests/vmi_lifecycle_test.go#L227

davidvossel · 2022-08-01T13:43:34Z

+		vmiName := chosenVMI.Name
+		vmiUID := chosenVMI.GetUID()


I found it confusing to declare a value for this vmiUID and then have it immediately overwritten 3 lines down.

davidvossel · 2022-08-01T13:45:57Z

+		DeleteOptions: &metav1.DeleteOptions{
+			GracePeriodSeconds: pointer.Int64(60 * 10), // 10 minutes
+		},
+	})).ShouldNot(Succeed())


to be accurate, we're looking for a specific error here, Expect(errors.IsTooManyRequests(err)).To(BeTrue()).

davidvossel · 2022-08-01T13:52:09Z

+		vmi, err = virtClient.VirtualMachineInstance(namespace).Get(vmiName, &metav1.GetOptions{})
+		g.Expect(err).ShouldNot(HaveOccurred())
+
+		vmiDebugPrintout(vmi)
+
+		g.Expect(vmi.Status.EvacuationNodeName).ShouldNot(BeEmpty())
+		g.Expect(vmi.DeletionTimestamp).ShouldNot(BeNil())


Since we have a controller that is deleting a VMI when evacuation node name is set, I think you should put a finalizer on the vmi in order to reliably observe EvacuationNodeName. otherwise it's possible the VMI could disappear before we see it in the functional test.

davidvossel · 2022-08-01T13:53:00Z

+		By("Read the worker node from the tenant cluster, and validate its IP")
+		Eventually(func(g Gomega) bool {
+			// reading the node and the VMI again and again, because it takes time to the IPs to be synchronized
+			node, err := clientSet.CoreV1().Nodes().Get(context.Background(), vmiName, metav1.GetOptions{})
+			g.Expect(err).ToNot(HaveOccurred())
+
+			var nodeIp string
+			for _, address := range node.Status.Addresses {
+				if address.Type == "InternalIP" {
+					nodeIp = address.Address
+				}
+			}
+
+			g.Expect(nodeIp).ShouldNot(BeEmpty(), "node's IP is not set")
+
+			vmi, err := virtClient.VirtualMachineInstance(namespace).Get(chosenVMI.Name, &metav1.GetOptions{})
+
+			g.Expect(err).ShouldNot(HaveOccurred())
+			g.Expect(vmi).ShouldNot(BeNil())
+
+			for _, ifs := range vmi.Status.Interfaces {
+				for _, ip := range ifs.IPs {
+					if ip == nodeIp {
+						return true
+					}
+				}
+			}
+			return false
+
+		}).WithTimeout(5 * time.Minute).WithPolling(10 * time.Second).Should(BeTrue())


i think it would make sense to move this to a helper function

davidvossel · 2022-08-01T14:16:20Z

+	nodeDrained, retryDuration, err := r.drainNode(ctx, cluster, vmi.Status.EvacuationNodeName, logger)
+	if err != nil {
+		return ctrl.Result{RequeueAfter: retryDuration}, err
+	}
+
+	if !nodeDrained && r.waitingForTimeout(ctx, vmi, logger) {
+		return ctrl.Result{RequeueAfter: retryDuration}, nil
+	}


The drain can block indefinitely due to the way this logic is structured. The timeout needs to be checked either regardless if the drainNode() function returns an error or before drainNode is even called. If the global timeout occurs, the VMI should be deleted.

We don't want anything within the guest cluster to indefinitely block an infra cluster node from being drained.

When a host node is deleted, the guest VMIs are evicted, killing the running tenant node in this VMI. This PR gracefully delete the VMI, by first drain the guest node. Only if the drain successful, CAPK will delete the VMI. The PR is based on a feature in KubeVirt, where by setting the eviction strategy to "external" KubeVirt will not delete the VMI, but will set the `vmi.Status.EvacuationNodeName` field to signal to the external controller (CAPK in this case) that the VMI should be evicted. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

Also, add unit tests. Needed to changed the logic to get the guest cluster client, to use the workloadClient type, because the previous code was not testable. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

If drain is failing for more than 10 minutes, delete the VMI anyway. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

davidvossel

/lgtm
/approve

excellent work 👍

k8s-ci-robot · 2022-08-03T14:40:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [davidvossel,nunnatsa]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 30, 2022

k8s-ci-robot requested review from justinsb and rmohr June 30, 2022 07:06

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2022

nunnatsa force-pushed the drain-node branch 2 times, most recently from 0658eb0 to 14fabcf Compare June 30, 2022 07:47

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2022

nunnatsa force-pushed the drain-node branch from 14fabcf to 649f260 Compare June 30, 2022 12:06

davidvossel reviewed Jun 30, 2022

View reviewed changes

nirarg reviewed Jul 3, 2022

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 4, 2022

nunnatsa force-pushed the drain-node branch from 90c6ea2 to ba199f5 Compare July 4, 2022 08:01

davidvossel reviewed Jul 6, 2022

View reviewed changes

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jul 12, 2022

nunnatsa force-pushed the drain-node branch 10 times, most recently from cf1e68c to 61f8a47 Compare July 13, 2022 13:56

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2022

nunnatsa force-pushed the drain-node branch from 56b11ce to c85f2f3 Compare July 26, 2022 11:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2022

nunnatsa force-pushed the drain-node branch 8 times, most recently from 24ff909 to a1ec316 Compare July 31, 2022 10:54

nunnatsa changed the title ~~WIP: Graceful deletion of a VirtualMachineInstance~~ Graceful deletion of a VirtualMachineInstance Jul 31, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2022

davidvossel reviewed Aug 1, 2022

View reviewed changes

nunnatsa added 5 commits August 2, 2022 12:12

Apply code-review comments

b9d14c9

Also, add unit tests. Needed to changed the logic to get the guest cluster client, to use the workloadClient type, because the previous code was not testable. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

Drain node: add timeout

210fd52

If drain is failing for more than 10 minutes, delete the VMI anyway. Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

Add VMI eviction functional test

d8b2b98

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

fix bug in the vmi eviction controller - wrong predicate

8c89b7c

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

nunnatsa force-pushed the drain-node branch 4 times, most recently from e7dfa69 to b1e5a0b Compare August 2, 2022 19:29

Fix code review comments

8a5e0df

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>

nunnatsa force-pushed the drain-node branch from b1e5a0b to 8a5e0df Compare August 2, 2022 20:28

davidvossel approved these changes Aug 3, 2022

View reviewed changes

k8s-ci-robot assigned davidvossel Aug 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 3, 2022

tiraboschi mentioned this pull request Sep 5, 2025

OCPBUGS-58397: feat(KubeVirt): configure External evictionStrategy on VMs openshift/hypershift#6380

Merged

4 tasks

Conversation

nunnatsa commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Known Issues

Uh oh!

coveralls commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 2785024320

💛 - Coveralls

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nunnatsa commented Jul 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidvossel left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nunnatsa commented Jun 30, 2022 •

edited

Loading

coveralls commented Jun 30, 2022 •

edited

Loading