Implementing Graceful Shutdown for VMs #101

so-sahu · 2024-01-05T06:04:25Z

Fixes #73

pkg/controllers/machine_controller.go

pkg/meta/meta.go

lukas016 · 2024-01-05T11:45:32Z

@lukasfrank if i remember correctly, you implemented exponential requeue for every returned error. Is it correct? Because if it is true, we can simplify code and delete requeue error and custom requeue interval.

lukasfrank · 2024-01-10T10:17:46Z

@lukasfrank if i remember correctly, you implemented exponential requeue for every returned error. Is it correct? Because if it is true, we can simplify code and delete requeue error and custom requeue interval.

Correct - the requeuing is already implemented in the RateLimitingQueue ( k8s.io/utils/clock)

so-sahu · 2024-01-10T11:34:29Z

I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations.

Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs.

lukasfrank · 2024-01-10T14:04:32Z

I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations.

Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs.

@so-sahu I think you mixed something up: The delay is only converted to nanoseconds. We use the following (default) settings:

func DefaultControllerRateLimiter() RateLimiter {
	return NewMaxOfRateLimiter(
		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

pkg/controllers/machine_controller.go

lukas016 · 2024-01-11T08:01:03Z

I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations.
Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs.

@so-sahu I think you mixed something up: The delay is only converted to nanoseconds. We use the following (default) settings:
func DefaultControllerRateLimiter() RateLimiter {
	return NewMaxOfRateLimiter(
		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

Hi, @lukasfrank and @so-sahu i did mistake when i asked this as comment. i created conversation for continue discussion about this problem (#101 (comment)) because using standard comments impact readability of PR bad way.

lukasfrank

As part of an offline discussion (@so-sahu @lukas016) we can conclude that the current approach is misusing the reconcile loop to delay the deletion process which leads to odd and unstable behaviour.

The agreed second iteration behaves the following way:

Server adds the deletion timestamp (already there)
Additional routine checks periodically ( store.List) if machines in the store contain deletion timestamp
- If timestamp + shutdown period is exceed the machine is being shutdown forcefully
- Otherwise VM receives graceful shutdown signal
machine dependent resources are being deleted + remove finalizer

To summarise: The shutdown handling is being factored out of the main routine + the main routine stays almost the same

afritzler

Instead of requeuing the Machine when we shutdown the domain I would rather implement the following termination flow:

When we delete a Machine, we add the deletionTimestamp to the Machines metadata and initiate a domain shutdown
In parallel we start in the app.go a GC go routine (e.g. a Garbage collection process) which configurable via flag (e.g. --gc-resync-period) is running over all Machines in the local store and checks if deletionTimestamt + GracePeriod > currentTime. If that is the case we forcefully shutdown the domain.

I would not reuse the requeuing mechanism in the reconciler for doing cleanup jobs.

UPDATE: sorry I didn't see @lukasfrank comment. But essentially this goes into the same direction.

provider/cmd/app/app.go

pkg/controllers/machine_controller.go

provider/cmd/app/app.go

pkg/controllers/machine_controller.go

provider/cmd/app/app.go

lukas016

lgtm i am waiting for rebase before approval

pkg/host/store.go

pkg/api/api.go

pkg/controllers/machine_controller.go

afritzler

We might also want to add a test case ensuring the correct behavior e.g. start the service with a relatively short grace period.

pkg/controllers/machine_controller.go

so-sahu · 2024-02-01T13:16:09Z

We might also want to add a test case ensuring the correct behavior e.g. start the service with a relatively short grace period.

While the integration tests are currently in progress, it would be nice to include these tests there to prevent any potential conflicts.

Signed-off-by: Lukas Frank <[email protected]>

so-sahu self-assigned this Jan 5, 2024

so-sahu requested a review from a team as a code owner January 5, 2024 06:04

github-actions bot added size/L controllers enhancement New feature or request labels Jan 5, 2024

lukas016 reviewed Jan 5, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

lukas016 reviewed Jan 5, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

lukas016 reviewed Jan 5, 2024

View reviewed changes

pkg/meta/meta.go Outdated Show resolved Hide resolved

lukas016 reviewed Jan 11, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

lukasfrank requested changes Jan 11, 2024

View reviewed changes

afritzler requested changes Jan 11, 2024

View reviewed changes

provider/cmd/app/app.go Outdated Show resolved Hide resolved

afritzler requested changes Jan 11, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

so-sahu requested review from lukas016, afritzler and lukasfrank January 17, 2024 09:10

lukas016 requested changes Jan 17, 2024

View reviewed changes

lukasfrank added the integration-tests to run integration tests label Jan 19, 2024

so-sahu requested a review from lukas016 January 22, 2024 13:47

lukas016 reviewed Jan 22, 2024

View reviewed changes

provider/cmd/app/app.go Outdated Show resolved Hide resolved

lukas016 reviewed Jan 22, 2024

View reviewed changes

provider/cmd/app/app.go Outdated Show resolved Hide resolved

lukas016 reviewed Jan 22, 2024

View reviewed changes

so-sahu force-pushed the feature/VM_graceful_shutdown branch from eb02b69 to bbb98ec Compare January 24, 2024 09:17

so-sahu requested a review from lukas016 January 24, 2024 09:21

lukas016 previously approved these changes Jan 24, 2024

View reviewed changes

lukas016 mentioned this pull request Jan 25, 2024

Add integration tests #138

Merged

lukasfrank requested changes Jan 29, 2024

View reviewed changes

pkg/host/store.go Outdated Show resolved Hide resolved

pkg/api/api.go Outdated Show resolved Hide resolved

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

so-sahu dismissed lukas016’s stale review via 766ff69 January 30, 2024 13:12

so-sahu force-pushed the feature/VM_graceful_shutdown branch from 766ff69 to 86467c5 Compare January 30, 2024 13:25

lukas016 requested changes Jan 30, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

lukas016 reviewed Feb 1, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

lukas016 reviewed Feb 1, 2024

View reviewed changes

pkg/controllers/machine_controller.go Outdated Show resolved Hide resolved

so-sahu requested review from lukasfrank and lukas016 February 1, 2024 11:14

lukas016 previously approved these changes Feb 1, 2024

View reviewed changes

lukasfrank previously approved these changes Feb 1, 2024

View reviewed changes

afritzler requested changes Feb 1, 2024

View reviewed changes

pkg/controllers/machine_controller.go Show resolved Hide resolved

pkg/controllers/machine_controller.go Show resolved Hide resolved

pkg/controllers/machine_controller.go Show resolved Hide resolved

so-sahu dismissed stale reviews from lukasfrank and lukas016 via 1e4e38f February 2, 2024 08:41

so-sahu and others added 9 commits February 2, 2024 08:42

implementing graceful shutdown for VMs

facdbcd

changes for review comments

8ac6f5c

changes for review comments

b378d54

changes for review comments

6816e73

changes for review comments

87b1207

change for review comment

f336079

Graceful shutdown review (#146)

4f8ed6e

Signed-off-by: Lukas Frank <[email protected]>

implementing graceful destroy

1eaab2d

adding comment for sending multiple shutdown signals

e3e4e19

so-sahu force-pushed the feature/VM_graceful_shutdown branch from 1e4e38f to e3e4e19 Compare February 2, 2024 08:42

lukas016 approved these changes Feb 2, 2024

View reviewed changes

lukasfrank requested a review from afritzler February 5, 2024 07:37

lukasfrank approved these changes Feb 5, 2024

View reviewed changes

afritzler approved these changes Feb 5, 2024

View reviewed changes

lukasfrank merged commit 593a887 into main Feb 5, 2024
8 checks passed

lukasfrank deleted the feature/VM_graceful_shutdown branch February 5, 2024 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Graceful Shutdown for VMs #101

Implementing Graceful Shutdown for VMs #101

so-sahu commented Jan 5, 2024

lukas016 commented Jan 5, 2024

lukasfrank commented Jan 10, 2024

so-sahu commented Jan 10, 2024

lukasfrank commented Jan 10, 2024

lukas016 commented Jan 11, 2024

lukasfrank left a comment •

edited

Loading

afritzler left a comment •

edited

Loading

lukas016 left a comment

afritzler left a comment

so-sahu commented Feb 1, 2024

Implementing Graceful Shutdown for VMs #101

Implementing Graceful Shutdown for VMs #101

Conversation

so-sahu commented Jan 5, 2024

lukas016 commented Jan 5, 2024

lukasfrank commented Jan 10, 2024

so-sahu commented Jan 10, 2024

lukasfrank commented Jan 10, 2024

lukas016 commented Jan 11, 2024

lukasfrank left a comment • edited Loading

Choose a reason for hiding this comment

afritzler left a comment • edited Loading

Choose a reason for hiding this comment

lukas016 left a comment

Choose a reason for hiding this comment

afritzler left a comment

Choose a reason for hiding this comment

so-sahu commented Feb 1, 2024

lukasfrank left a comment •

edited

Loading

afritzler left a comment •

edited

Loading