update leader election defaults so it handles 60s of kube-apiserver communication disruption#1104
Conversation
| func LeaderElectionDefaulting(config configv1.LeaderElection, defaultNamespace, defaultName string) configv1.LeaderElection { | ||
| ret := *(&config).DeepCopy() | ||
|
|
||
| // 1. lock skew tolerance is leaseDuration-renewDeadline == 22s |
There was a problem hiding this comment.
I might suggest we call this method FastLeaderElectionDefaulting or CriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?
One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?
There was a problem hiding this comment.
I might suggest we call this method
FastLeaderElectionDefaultingorCriticalLeaderElectionDefaulting, since these defaults should only be used for the most critical services. Are there a set of library-go based operators that are not critical that can tolerate 30s / 3 retry leases?One more question - why are you trying to have 6 retries (i.e. what reason for 6 vs 3)?
Now answered in the code comments.
|
I appreciate the commit title. LGTM |
…ommunication disruption
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Update the package-server-manager controller and update the leader election intervals. This is in order to comply with the need for OCP components to be able to withstand 60s of API server disruption on SNO-enabled clusters. For more information, see the following resources: - openshift/library-go#1104 (comment) - https://bugzilla.redhat.com/show_bug.cgi?id=1985697 Alternative implementations include disabling leader election entirely for SNO-enabled clusters. This implementation is centered around dynamically querying for the Infrastructure/cluster singleton resource, checking the HA/non-HA expectations being exposed, and setting leader election properly. This implementation would still need to be careful about how to handle transient errors and provide an escape hatch (e.g. prefer enablement of leader election through a CLI flag, vs. the dynamic value) that users can pass to the PSM deployment for failed upgrades.
Update the package-server-manager controller and update the leader election intervals. This is in order to comply with the need for OCP components to be able to withstand 60s of API server disruption on SNO-enabled clusters. For more information, see the following resources: - openshift/library-go#1104 (comment) - https://bugzilla.redhat.com/show_bug.cgi?id=1985697 Alternative implementations include disabling leader election entirely for SNO-enabled clusters. This implementation is centered around dynamically querying for the Infrastructure/cluster singleton resource, checking the HA/non-HA expectations being exposed, and setting leader election properly. This implementation would still need to be careful about how to handle transient errors and provide an escape hatch (e.g. prefer enablement of leader election through a CLI flag, vs. the dynamic value) that users can pass to the PSM deployment for failed upgrades.
This bump is intended to address the issue that SNO cluster cannot handle 60s of API server communication disruptions. The fix was added in openshift/library-go#1104.
To be able an api server disruptions on SNO, the leader election timeouts needs to be adjusted acording to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
To be able to handle an api server disruption on SNO, the leader election timeouts needs to be adjusted according to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
To be able to handle an api server disruption on SNO, the leader election timeouts needs to be adjusted according to github.com/openshift/library-go/pull/1104. Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
found via openshift/origin#26215
We want to handle 60s of communication disruption in all components.