Bug 1858403: Use client-go leader election to write less. #231

dgoodwin · 2020-07-31T19:05:56Z

We were using the defaults from controller runtime previously:

lease duration: 15s
renew deadline: 10s
retry period: 2s

This meant that the active leader was writing to etcd every 2 seconds to
update the lease, which is excessive writing and spawned the bug above.

We now implement leader election using the underlying client-go code to
get access to ReleaseOnCancel, which is not presently exposed in
controller-runtime.

This allows us to immediately release the lock on normal shutdown
eliminating delay before another pod takes over, as well as startup
delay when doing development etc.

openshift-ci-robot · 2020-07-31T19:05:59Z

@dgoodwin: This pull request references Bugzilla bug 1858403, which is invalid:

expected the bug to target the "4.6.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1858403: Tune leader election to write less.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

staebler

/lgtm

smarterclayton · 2020-07-31T19:46:38Z

/hold

Please see my comments on the BZ, this is not the right way to solve this problem (you should release the lock on shutdown)

openshift-ci-robot · 2020-08-05T12:15:41Z

@dgoodwin: This pull request references Bugzilla bug 1858403, which is invalid:

expected the bug to target the "4.6.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1858403: Tune leader election to write less.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dgoodwin · 2020-08-05T12:17:40Z

PR updated, now uses client-go leader election directly so we can get access to ReleaseOnCancel which is not currently exposed in controller-runtime.

Now generating uuids for leader election ids, previously it was using a flat string which seems like it should have been busted, but somehow they were getting uuids appended in the actual configmap.

Tested bringing up multiple processes and switching between. Provided we exit cleanly there is just a couple seconds startup delay when the new process takes over, or just when restarting a single process.

dgoodwin · 2020-08-05T12:25:44Z

/bugzilla refresh

openshift-ci-robot · 2020-08-05T12:25:50Z

@dgoodwin: This pull request references Bugzilla bug 1858403, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

staebler

A nice improvement!

staebler · 2020-08-05T13:15:10Z

pkg/cmd/operator/cmd.go

-			if err := controller.AddToManager(mgr, kubeconfigCommandLinePath); err != nil {
-				log.WithError(err).Fatal("unable to register controllers to the manager")
+			// Leader election code based on:
+			// https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130


Make this a permalink, and cover the entire section that you are basing the code on.
https://github.com/kubernetes/kubernetes/blob/f7e3bcdec2e090b7361a61e21c20b3dbbb41b7f0/staging/src/k8s.io/client-go/examples/leader-election/main.go#L92-L154

staebler · 2020-08-05T13:18:00Z

pkg/cmd/operator/cmd.go

+			leLog := log.WithField("id", id)
+			leLog.Info("generated leader election ID")
+
+			lock := &resourcelock.ConfigMapLock{


Is this a good time to switch to using a LeaseLock instead of a ConfigMapLock? Or is this more complicated due to needing to upgrade existing CCOs that would still be using a ConfigMapLock?

I wasn't totally aware how leaselock works, doesn't appear to be a resource in-cluster at least in 4.5 where I was running this. There is potential for some problems during upgrade though for sure, I'd vote to stay consistent for now.

We were using the defaults from controller runtime previously: lease duration: 15s renew deadline: 10s retry period: 2s This meant that the active leader was writing to etcd every 2 seconds to update the lease, which is excessive writing and spawned the bug above. We now implement leader election using the underlying client-go code to get access to ReleaseOnCancel, which is not presently exposed in controller-runtime. This allows us to immediately release the lock on normal shutdown eliminating delay before another pod takes over, as well as startup delay when doing development etc.

staebler

/lgtm

dgoodwin · 2020-08-10T11:59:43Z

/test e2e-aws

dgoodwin · 2020-08-10T16:14:58Z

/test e2e-aws

dgoodwin · 2020-08-11T11:38:54Z

/lgtm

openshift-ci-robot · 2020-08-11T11:38:55Z

@dgoodwin: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2020-08-16T18:57:30Z

/retest