Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Graceful Shutdown for VMs #101

Merged
merged 9 commits into from
Feb 5, 2024

Conversation

so-sahu
Copy link
Contributor

@so-sahu so-sahu commented Jan 5, 2024

Fixes #73

@so-sahu so-sahu self-assigned this Jan 5, 2024
@so-sahu so-sahu requested a review from a team as a code owner January 5, 2024 06:04
@github-actions github-actions bot added size/L controllers enhancement New feature or request labels Jan 5, 2024
@lukas016
Copy link
Contributor

lukas016 commented Jan 5, 2024

@lukasfrank if i remember correctly, you implemented exponential requeue for every returned error. Is it correct? Because if it is true, we can simplify code and delete requeue error and custom requeue interval.

@lukasfrank
Copy link
Member

@lukasfrank if i remember correctly, you implemented exponential requeue for every returned error. Is it correct? Because if it is true, we can simplify code and delete requeue error and custom requeue interval.

Correct - the requeuing is already implemented in the RateLimitingQueue ( k8s.io/utils/clock)

@so-sahu
Copy link
Contributor Author

so-sahu commented Jan 10, 2024

I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations.

Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs.

@lukasfrank
Copy link
Member

I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations.

Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs.

@so-sahu I think you mixed something up: The delay is only converted to nanoseconds. We use the following (default) settings:

func DefaultControllerRateLimiter() RateLimiter {
	return NewMaxOfRateLimiter(
		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

@lukas016
Copy link
Contributor

I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations.
Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs.

@so-sahu I think you mixed something up: The delay is only converted to nanoseconds. We use the following (default) settings:

func DefaultControllerRateLimiter() RateLimiter {
	return NewMaxOfRateLimiter(
		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

Hi, @lukasfrank and @so-sahu i did mistake when i asked this as comment. i created conversation for continue discussion about this problem (#101 (comment)) because using standard comments impact readability of PR bad way.

Copy link
Member

@lukasfrank lukasfrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of an offline discussion (@so-sahu @lukas016) we can conclude that the current approach is misusing the reconcile loop to delay the deletion process which leads to odd and unstable behaviour.

The agreed second iteration behaves the following way:

  • Server adds the deletion timestamp (already there)
  • Additional routine checks periodically ( store.List) if machines in the store contain deletion timestamp
    • If timestamp + shutdown period is exceed the machine is being shutdown forcefully
    • Otherwise VM receives graceful shutdown signal
  • machine dependent resources are being deleted + remove finalizer

To summarise: The shutdown handling is being factored out of the main routine + the main routine stays almost the same

Copy link
Member

@afritzler afritzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of requeuing the Machine when we shutdown the domain I would rather implement the following termination flow:

  • When we delete a Machine, we add the deletionTimestamp to the Machines metadata and initiate a domain shutdown
  • In parallel we start in the app.go a GC go routine (e.g. a Garbage collection process) which configurable via flag (e.g. --gc-resync-period) is running over all Machines in the local store and checks if deletionTimestamt + GracePeriod > currentTime. If that is the case we forcefully shutdown the domain.

I would not reuse the requeuing mechanism in the reconciler for doing cleanup jobs.

UPDATE: sorry I didn't see @lukasfrank comment. But essentially this goes into the same direction.

@lukasfrank lukasfrank added the integration-tests to run integration tests label Jan 19, 2024
@so-sahu so-sahu requested a review from lukas016 January 22, 2024 13:47
Copy link
Contributor

@lukas016 lukas016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm i am waiting for rebase before approval

@so-sahu so-sahu force-pushed the feature/VM_graceful_shutdown branch from eb02b69 to bbb98ec Compare January 24, 2024 09:17
@so-sahu so-sahu requested a review from lukas016 January 24, 2024 09:21
lukas016
lukas016 previously approved these changes Jan 24, 2024
@lukas016 lukas016 mentioned this pull request Jan 25, 2024
lukas016
lukas016 previously approved these changes Feb 1, 2024
lukasfrank
lukasfrank previously approved these changes Feb 1, 2024
Copy link
Member

@afritzler afritzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to add a test case ensuring the correct behavior e.g. start the service with a relatively short grace period.

@so-sahu
Copy link
Contributor Author

so-sahu commented Feb 1, 2024

We might also want to add a test case ensuring the correct behavior e.g. start the service with a relatively short grace period.

While the integration tests are currently in progress, it would be nice to include these tests there to prevent any potential conflicts.

@so-sahu so-sahu dismissed stale reviews from lukasfrank and lukas016 via 1e4e38f February 2, 2024 08:41
@so-sahu so-sahu force-pushed the feature/VM_graceful_shutdown branch from 1e4e38f to e3e4e19 Compare February 2, 2024 08:42
@lukasfrank lukasfrank requested a review from afritzler February 5, 2024 07:37
@lukasfrank lukasfrank merged commit 593a887 into main Feb 5, 2024
8 checks passed
@lukasfrank lukasfrank deleted the feature/VM_graceful_shutdown branch February 5, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
controllers enhancement New feature or request integration-tests to run integration tests size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Graceful Shutdown for VMs
4 participants