Skip to content

Conversation

@dgoodwin
Copy link
Contributor

@dgoodwin dgoodwin commented Jul 31, 2020

We were using the defaults from controller runtime previously:

lease duration: 15s
renew deadline: 10s
retry period: 2s

This meant that the active leader was writing to etcd every 2 seconds to
update the lease, which is excessive writing and spawned the bug above.

We now implement leader election using the underlying client-go code to
get access to ReleaseOnCancel, which is not presently exposed in
controller-runtime.

This allows us to immediately release the lock on normal shutdown
eliminating delay before another pod takes over, as well as startup
delay when doing development etc.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jul 31, 2020
@openshift-ci-robot
Copy link
Contributor

@dgoodwin: This pull request references Bugzilla bug 1858403, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1858403: Tune leader election to write less.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jul 31, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2020
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2020
@smarterclayton
Copy link
Contributor

/hold

Please see my comments on the BZ, this is not the right way to solve this problem (you should release the lock on shutdown)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 31, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2020
@openshift-ci-robot
Copy link
Contributor

@dgoodwin: This pull request references Bugzilla bug 1858403, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1858403: Tune leader election to write less.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dgoodwin dgoodwin changed the title Bug 1858403: Tune leader election to write less. Bug 1858403: Use client-go leader election to write less. Aug 5, 2020
@dgoodwin
Copy link
Contributor Author

dgoodwin commented Aug 5, 2020

PR updated, now uses client-go leader election directly so we can get access to ReleaseOnCancel which is not currently exposed in controller-runtime.

Now generating uuids for leader election ids, previously it was using a flat string which seems like it should have been busted, but somehow they were getting uuids appended in the actual configmap.

Tested bringing up multiple processes and switching between. Provided we exit cleanly there is just a couple seconds startup delay when the new process takes over, or just when restarting a single process.

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Aug 5, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 5, 2020
@openshift-ci-robot
Copy link
Contributor

@dgoodwin: This pull request references Bugzilla bug 1858403, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 5, 2020
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice improvement!

if err := controller.AddToManager(mgr, kubeconfigCommandLinePath); err != nil {
log.WithError(err).Fatal("unable to register controllers to the manager")
// Leader election code based on:
// https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

leLog := log.WithField("id", id)
leLog.Info("generated leader election ID")

lock := &resourcelock.ConfigMapLock{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good time to switch to using a LeaseLock instead of a ConfigMapLock? Or is this more complicated due to needing to upgrade existing CCOs that would still be using a ConfigMapLock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't totally aware how leaselock works, doesn't appear to be a resource in-cluster at least in 4.5 where I was running this. There is potential for some problems during upgrade though for sure, I'd vote to stay consistent for now.

We were using the defaults from controller runtime previously:

lease duration: 15s
renew deadline: 10s
retry period: 2s

This meant that the active leader was writing to etcd every 2 seconds to
update the lease, which is excessive writing and spawned the bug above.

We now implement leader election using the underlying client-go code to
get access to ReleaseOnCancel, which is not presently exposed in
controller-runtime.

This allows us to immediately release the lock on normal shutdown
eliminating delay before another pod takes over, as well as startup
delay when doing development etc.
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2020
@dgoodwin
Copy link
Contributor Author

/test e2e-aws

1 similar comment
@dgoodwin
Copy link
Contributor Author

/test e2e-aws

@dgoodwin
Copy link
Contributor Author

/lgtm

@openshift-ci-robot
Copy link
Contributor

@dgoodwin: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sdodson
Copy link
Member

sdodson commented Aug 17, 2020

/hold
@dgoodwin ptql seems like a persistent failurr

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2020
@joelddiaz
Copy link
Contributor

/test e2e-aws
Looked through the artifacts from the most recent failure, and didn't come across anything pointing to CCO. Built a custom image with this PR's changes and the CI images from https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.6.0-0.ci/release/4.6.0-0.ci-2020-08-18-103144. I saw no issues.

@joelddiaz
Copy link
Contributor

/test e2e-aws

@gregsheremeta
Copy link
Contributor

/retest

@dgoodwin
Copy link
Contributor Author

Failing for a week? This is really strange, there does not appear to be anything in here which could impact beyond the CCO, which we can see is running fine in the artifacts.

/test e2e-aws

@dgoodwin
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

@dgoodwin: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws 9ec5207 link /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dgoodwin
Copy link
Contributor Author

Ok @joelddiaz had the idea that perhaps this PR is cursed and suggested we reopen it, indeed that passed e2e-aws on first try. I am replacing this PR with #239, I have no idea why this is seeing the failures it is but we can't see any reason why a change to our leader election would be causing mass API failures across the cluster.

@dgoodwin dgoodwin closed this Aug 26, 2020
@openshift-ci-robot
Copy link
Contributor

@dgoodwin: This pull request references Bugzilla bug 1858403. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1858403: Use client-go leader election to write less.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants