update leader election defaults so it handles 60s of kube-apiserver communication disruption by deads2k · Pull Request #1104 · openshift/library-go

deads2k · 2021-06-09T17:36:02Z

We want to handle 60s of communication disruption in all components.

pkg/config/leaderelection/leaderelection.go

smarterclayton · 2021-07-08T15:22:04Z

pkg/config/leaderelection/leaderelection.go

 func LeaderElectionDefaulting(config configv1.LeaderElection, defaultNamespace, defaultName string) configv1.LeaderElection {
 	ret := *(&config).DeepCopy()

+	// 1. lock skew tolerance is leaseDuration-renewDeadline == 22s


I might suggest we call this method FastLeaderElectionDefaulting or CriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?

One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?

I might suggest we call this method FastLeaderElectionDefaulting or CriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?

One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?

Now answered in the code comments.

smarterclayton · 2021-07-08T16:32:43Z

I appreciate the commit title.

LGTM

…ommunication disruption

smarterclayton · 2021-07-08T19:14:51Z

/lgtm

openshift-ci · 2021-07-08T19:15:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Update the package-server-manager controller and update the leader election intervals. This is in order to comply with the need for OCP components to be able to withstand 60s of API server disruption on SNO-enabled clusters. For more information, see the following resources: - openshift/library-go#1104 (comment) - https://bugzilla.redhat.com/show_bug.cgi?id=1985697 Alternative implementations include disabling leader election entirely for SNO-enabled clusters. This implementation is centered around dynamically querying for the Infrastructure/cluster singleton resource, checking the HA/non-HA expectations being exposed, and setting leader election properly. This implementation would still need to be careful about how to handle transient errors and provide an escape hatch (e.g. prefer enablement of leader election through a CLI flag, vs. the dynamic value) that users can pass to the PSM deployment for failed upgrades.

This bump is intended to address the issue that SNO cluster cannot handle 60s of API server communication disruptions. The fix was added in openshift/library-go#1104.

To be able an api server disruptions on SNO, the leader election timeouts needs to be adjusted acording to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>

To be able to handle an api server disruption on SNO, the leader election timeouts needs to be adjusted according to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>

openshift-ci bot requested review from smarterclayton and sttts June 9, 2021 17:36

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2021

smarterclayton reviewed Jun 10, 2021

View reviewed changes

pkg/config/leaderelection/leaderelection.go Outdated Show resolved Hide resolved

smarterclayton reviewed Jun 10, 2021

View reviewed changes

pkg/config/leaderelection/leaderelection.go Outdated Show resolved Hide resolved

deads2k force-pushed the span-60s branch from 15b82b4 to e3a81ab Compare June 10, 2021 16:44

smarterclayton reviewed Jun 10, 2021

View reviewed changes

pkg/config/leaderelection/leaderelection.go Outdated Show resolved Hide resolved

deads2k force-pushed the span-60s branch from e3a81ab to 72c43e1 Compare June 10, 2021 19:09

smarterclayton reviewed Jul 8, 2021

View reviewed changes

deads2k force-pushed the span-60s branch from 356a15d to 39d3c07 Compare July 8, 2021 16:25

update leader election defaults so it handles 60s of kube-apiserver c…

7e7d216

…ommunication disruption

deads2k force-pushed the span-60s branch from 39d3c07 to 7e7d216 Compare July 8, 2021 17:31

openshift-ci bot assigned smarterclayton Jul 8, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2021

openshift-merge-robot merged commit 4b9033d into openshift:master Jul 8, 2021

aojea mentioned this pull request Jul 26, 2021

Bug 1984635: use new default leader election values to handle apiserver rollout on SNO openshift/cluster-config-operator#211

Merged

timflannagan mentioned this pull request Jul 27, 2021

Bug 1985697: Update the package-server-manager leader election configuration openshift/operator-framework-olm#136

Merged

creydr mentioned this pull request Jul 30, 2021

Bug 1984683: use new default leader election values to handle apiserver rollout on SNO openshift/sdn#328

Merged

bertinatto mentioned this pull request Aug 2, 2021

Bug 1986215: Bump library-go to get leader election fixes openshift/cluster-storage-operator#196

Merged

creydr mentioned this pull request Aug 4, 2021

Bug 1989246: use new default leader election values to handle apiserver rollout on SNO openshift/cluster-network-operator#1175

Merged

creydr mentioned this pull request Jan 10, 2022

Bug 2033751: Update leader election timeouts to handle api server disruptions on SNO openshift/cloud-network-config-controller#15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update leader election defaults so it handles 60s of kube-apiserver communication disruption#1104

update leader election defaults so it handles 60s of kube-apiserver communication disruption#1104
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
deads2k:span-60s

deads2k commented Jun 9, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smarterclayton Jul 8, 2021

Uh oh!

deads2k Jul 8, 2021

Uh oh!

smarterclayton commented Jul 8, 2021

Uh oh!

smarterclayton commented Jul 8, 2021

Uh oh!

openshift-ci bot commented Jul 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

deads2k commented Jun 9, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smarterclayton Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

deads2k Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Jul 8, 2021

Uh oh!

smarterclayton commented Jul 8, 2021

Uh oh!

openshift-ci bot commented Jul 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments