-
Notifications
You must be signed in to change notification settings - Fork 159
Bug 1858403: Use client-go leader election to write less. #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dgoodwin: This pull request references Bugzilla bug 1858403, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/hold Please see my comments on the BZ, this is not the right way to solve this problem (you should release the lock on shutdown) |
|
@dgoodwin: This pull request references Bugzilla bug 1858403, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
PR updated, now uses client-go leader election directly so we can get access to ReleaseOnCancel which is not currently exposed in controller-runtime. Now generating uuids for leader election ids, previously it was using a flat string which seems like it should have been busted, but somehow they were getting uuids appended in the actual configmap. Tested bringing up multiple processes and switching between. Provided we exit cleanly there is just a couple seconds startup delay when the new process takes over, or just when restarting a single process. |
|
/bugzilla refresh |
|
@dgoodwin: This pull request references Bugzilla bug 1858403, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice improvement!
pkg/cmd/operator/cmd.go
Outdated
| if err := controller.AddToManager(mgr, kubeconfigCommandLinePath); err != nil { | ||
| log.WithError(err).Fatal("unable to register controllers to the manager") | ||
| // Leader election code based on: | ||
| // https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a permalink, and cover the entire section that you are basing the code on.
https://github.com/kubernetes/kubernetes/blob/f7e3bcdec2e090b7361a61e21c20b3dbbb41b7f0/staging/src/k8s.io/client-go/examples/leader-election/main.go#L92-L154
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| leLog := log.WithField("id", id) | ||
| leLog.Info("generated leader election ID") | ||
|
|
||
| lock := &resourcelock.ConfigMapLock{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good time to switch to using a LeaseLock instead of a ConfigMapLock? Or is this more complicated due to needing to upgrade existing CCOs that would still be using a ConfigMapLock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't totally aware how leaselock works, doesn't appear to be a resource in-cluster at least in 4.5 where I was running this. There is potential for some problems during upgrade though for sure, I'd vote to stay consistent for now.
We were using the defaults from controller runtime previously: lease duration: 15s renew deadline: 10s retry period: 2s This meant that the active leader was writing to etcd every 2 seconds to update the lease, which is excessive writing and spawned the bug above. We now implement leader election using the underlying client-go code to get access to ReleaseOnCancel, which is not presently exposed in controller-runtime. This allows us to immediately release the lock on normal shutdown eliminating delay before another pod takes over, as well as startup delay when doing development etc.
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
/lgtm |
|
@dgoodwin: you cannot LGTM your own PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
10 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/hold |
|
/test e2e-aws |
|
/test e2e-aws |
|
/retest |
|
Failing for a week? This is really strange, there does not appear to be anything in here which could impact beyond the CCO, which we can see is running fine in the artifacts. /test e2e-aws |
|
/retest |
|
@dgoodwin: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Ok @joelddiaz had the idea that perhaps this PR is cursed and suggested we reopen it, indeed that passed e2e-aws on first try. I am replacing this PR with #239, I have no idea why this is seeing the failures it is but we can't see any reason why a change to our leader election would be causing mass API failures across the cluster. |
|
@dgoodwin: This pull request references Bugzilla bug 1858403. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We were using the defaults from controller runtime previously:
lease duration: 15s
renew deadline: 10s
retry period: 2s
This meant that the active leader was writing to etcd every 2 seconds to
update the lease, which is excessive writing and spawned the bug above.
We now implement leader election using the underlying client-go code to
get access to ReleaseOnCancel, which is not presently exposed in
controller-runtime.
This allows us to immediately release the lock on normal shutdown
eliminating delay before another pod takes over, as well as startup
delay when doing development etc.