-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Graceful Shutdown for VMs #101
Conversation
@lukasfrank if i remember correctly, you implemented exponential requeue for every returned error. Is it correct? Because if it is true, we can simplify code and delete requeue error and custom requeue interval. |
Correct - the requeuing is already implemented in the |
I acknowledge that the requeuing functionality is already incorporated in the RateLimitingQueue. However, I find the nanosecond precision too rapid, especially in the context of virtual machine shutdowns (refer to: source). When utilizing the RateLimitingQueue for VM shutdowns, the heightened precision leads to accelerated reconciliations, resulting in an unnecessary surge in the number of machine reconciliations. Therefore, I propose the inclusion of existing custom requeue interval for the graceful shutdown of VMs. |
@so-sahu I think you mixed something up: The delay is only converted to nanoseconds. We use the following (default) settings:
|
Hi, @lukasfrank and @so-sahu i did mistake when i asked this as comment. i created conversation for continue discussion about this problem (#101 (comment)) because using standard comments impact readability of PR bad way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of an offline discussion (@so-sahu @lukas016) we can conclude that the current approach is misusing the reconcile loop to delay the deletion process which leads to odd and unstable behaviour.
The agreed second iteration behaves the following way:
- Server adds the deletion timestamp (already there)
- Additional routine checks periodically (
store.List
) if machines in the store contain deletion timestamp- If timestamp + shutdown period is exceed the machine is being shutdown forcefully
- Otherwise VM receives graceful shutdown signal
- machine dependent resources are being deleted + remove finalizer
To summarise: The shutdown handling is being factored out of the main routine + the main routine stays almost the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of requeuing the Machine
when we shutdown the domain I would rather implement the following termination flow:
- When we delete a
Machine
, we add thedeletionTimestamp
to theMachine
s metadata and initiate a domain shutdown - In parallel we start in the
app.go
a GC go routine (e.g. a Garbage collection process) which configurable via flag (e.g.--gc-resync-period
) is running over allMachines
in the local store and checks ifdeletionTimestamt + GracePeriod
>currentTime
. If that is the case we forcefully shutdown the domain.
I would not reuse the requeuing mechanism in the reconciler for doing cleanup jobs.
UPDATE: sorry I didn't see @lukasfrank comment. But essentially this goes into the same direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm i am waiting for rebase before approval
eb02b69
to
bbb98ec
Compare
766ff69
to
86467c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to add a test case ensuring the correct behavior e.g. start the service with a relatively short grace period.
While the integration tests are currently in progress, it would be nice to include these tests there to prevent any potential conflicts. |
Signed-off-by: Lukas Frank <[email protected]>
1e4e38f
to
e3e4e19
Compare
Fixes #73