Restart external-attacher faster by releasing leader election lease on sigterm#633
Conversation
|
Hi @rhrmo. Thanks for your PR. I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
I tested in on a cluster. What's great is that when the attacher gets SIGTERM, it exits immediately (this PR does not change that) + release its lease, so a newly started attacher starts immediately. This is taken when SIGTERM was received in the middle of ControllerPublish: What's not so great is that the CSI driver does not know about any of this. It got This could happen also today, however, before this PR, the new attacher would need to wait for the old attacher lease to expire (15 seconds in the default config). Which could give the old CSI driver enough time either complete the attachment or die (because it's unlikely that the attacher got SIGTERM and the CSI driver in the same Pod did not). |
Instead of shutting down immediately on a sigterm, can we wait until all ongoing requests are finished? It would probably require changes to our reconciler. |
We would need two contexts then and it might get messy which context should be used when. @rhrmo can you explore this idea? I think the only place that should unblock when a signal is received would be here (to call And here to finish periodic reconcile: Maybe you find other places. But it will be different in each sidecar. |
|
Sure thing! |
f60ce7f to
e45a1ac
Compare
166b06c to
ff3a93f
Compare
I can see there is a code that waits for workers to finish before exiting, still, they should finish with a lot of |
ff3a93f to
241758b
Compare
16eaf56 to
9ab7f74
Compare
e431b7f to
f718411
Compare
f718411 to
e8b0aac
Compare
|
/retest-required |
a978bb8 to
2112285
Compare
| if utilfeature.DefaultFeatureGate.Enabled(features.ReleaseLeaderElectionOnExit) { | ||
| var wg sync.WaitGroup | ||
| stopCh := shutdownHandler | ||
| factory.Start(stopCh) | ||
| ctrl.Run(controllerCtx, int(*workerThreads), &wg) | ||
| if signalReceived { | ||
| wg.Wait() | ||
| terminate() | ||
| } |
There was a problem hiding this comment.
Just adding few comments for my future self:
| if utilfeature.DefaultFeatureGate.Enabled(features.ReleaseLeaderElectionOnExit) { | |
| var wg sync.WaitGroup | |
| stopCh := shutdownHandler | |
| factory.Start(stopCh) | |
| ctrl.Run(controllerCtx, int(*workerThreads), &wg) | |
| if signalReceived { | |
| wg.Wait() | |
| terminate() | |
| } | |
| if utilfeature.DefaultFeatureGate.Enabled(features.ReleaseLeaderElectionOnExit) { | |
| var wg sync.WaitGroup | |
| stopCh := shutdownHandler | |
| factory.Start(stopCh) | |
| ctrl.Run(controllerCtx, int(*workerThreads), &wg) | |
| if signalReceived { | |
| // Wait for the controllers to finish | |
| wg.Wait() | |
| // Finish the overall context and release leader election. | |
| terminate() | |
| } |
2112285 to
0f82cdb
Compare
| stopCh := shutdownHandler | ||
| factory.Start(stopCh) | ||
| ctrl.Run(controllerCtx, int(*workerThreads), &wg) | ||
| if signalReceived { |
There was a problem hiding this comment.
Can Run() above finish without a signal received? Assuming it could, what would be the right handling? IMO releasing the leader election + exit (i.e. terminate()) would be correct in all cases where Run() finishes.
There was a problem hiding this comment.
I.e. IMO we could remove the whole variable signalReceived
There was a problem hiding this comment.
I think Run() should be able to finish without signal but I am not entirely sure now that you ask, I will test it. I agree with releasing the leader election + exit after Run() finishes.
There was a problem hiding this comment.
I removed the whole signalReceived variable, testing Run() is bit tricky though because it waits for the context to be cancelled before it returns, by trying to call return at it's end instead <-ctx.Done() it finished without any issues.
0f82cdb to
7d82cda
Compare
7d82cda to
40429ec
Compare
40429ec to
7aaec36
Compare
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane, rhrmo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR makes external-attacher restart faster when it is terminating because of sigterm. This is accomplished by releasing leader election lease on sigterm and then it does not have to wait for lease to expire but it proceeds with restart right away. This is the first PR of series of PRs that fix #609
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This PR is the first of the series of PRs that fix #609 and I want to use it to agree on a solution - if you have any suggestions, or find anything wrong, please add a comment and let me know.
I had to import newer version of csi-lib-utils library, so that it would contain changes from kubernetes-csi/csi-lib-utils#191 (go get github.com/kubernetes-csi/csi-lib-utils@f77445f75e9c4954062cdddcef4863adab21947b), so before merging this we would also need to have new version of csi-lib-utils released.
Does this PR introduce a user-facing change?: