Skip to content

update leader election defaults so it handles 60s of kube-apiserver communication disruption#1104

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
deads2k:span-60s
Jul 8, 2021
Merged

update leader election defaults so it handles 60s of kube-apiserver communication disruption#1104
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
deads2k:span-60s

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Jun 9, 2021

found via openshift/origin#26215

We want to handle 60s of communication disruption in all components.

@openshift-ci openshift-ci bot requested review from smarterclayton and sttts June 9, 2021 17:36
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2021
func LeaderElectionDefaulting(config configv1.LeaderElection, defaultNamespace, defaultName string) configv1.LeaderElection {
ret := *(&config).DeepCopy()

// 1. lock skew tolerance is leaseDuration-renewDeadline == 22s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest we call this method FastLeaderElectionDefaulting or CriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?

One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest we call this method FastLeaderElectionDefaulting or CriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?

One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?

Now answered in the code comments.

@smarterclayton
Copy link
Contributor

I appreciate the commit title.

LGTM

@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 8, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [deads2k,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 4b9033d into openshift:master Jul 8, 2021
timflannagan added a commit to timflannagan/operator-framework-olm that referenced this pull request Jul 27, 2021
Update the package-server-manager controller and update the leader
election intervals. This is in order to comply with the need for OCP
components to be able to withstand 60s of API server disruption on
SNO-enabled clusters.

For more information, see the following resources:
- openshift/library-go#1104 (comment)
- https://bugzilla.redhat.com/show_bug.cgi?id=1985697

Alternative implementations include disabling leader election entirely
for SNO-enabled clusters. This implementation is centered around
dynamically querying for the Infrastructure/cluster singleton resource,
checking the HA/non-HA expectations being exposed, and setting leader
election properly. This implementation would still need to be careful
about how to handle transient errors and provide an escape hatch (e.g.
prefer enablement of leader election through a CLI flag, vs. the dynamic
value) that users can pass to the PSM deployment for failed upgrades.
timflannagan added a commit to timflannagan/operator-framework-olm that referenced this pull request Jul 27, 2021
Update the package-server-manager controller and update the leader
election intervals. This is in order to comply with the need for OCP
components to be able to withstand 60s of API server disruption on
SNO-enabled clusters.

For more information, see the following resources:
- openshift/library-go#1104 (comment)
- https://bugzilla.redhat.com/show_bug.cgi?id=1985697

Alternative implementations include disabling leader election entirely
for SNO-enabled clusters. This implementation is centered around
dynamically querying for the Infrastructure/cluster singleton resource,
checking the HA/non-HA expectations being exposed, and setting leader
election properly. This implementation would still need to be careful
about how to handle transient errors and provide an escape hatch (e.g.
prefer enablement of leader election through a CLI flag, vs. the dynamic
value) that users can pass to the PSM deployment for failed upgrades.
bertinatto added a commit to bertinatto/cluster-storage-operator that referenced this pull request Aug 2, 2021
This bump is intended to address the issue that SNO cluster
cannot handle 60s of API server communication disruptions.
The fix was added in openshift/library-go#1104.
creydr added a commit to creydr/cloud-network-config-controller that referenced this pull request Jan 10, 2022
To be able an api server disruptions on SNO, the leader election timeouts
needs to be adjusted acording to
github.com/openshift/library-go/pull/1104.

Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
creydr added a commit to creydr/cloud-network-config-controller that referenced this pull request Jan 10, 2022
To be able to handle an api server disruption on SNO, the leader
election timeouts needs to be adjusted according to
github.com/openshift/library-go/pull/1104.

Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
creydr added a commit to creydr/cloud-network-config-controller that referenced this pull request Jan 10, 2022
To be able to handle an api server disruption on SNO, the leader
election timeouts needs to be adjusted according to
github.com/openshift/library-go/pull/1104.

Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments