Skip to content

Conversation

@Danil-Grigorev
Copy link

@Danil-Grigorev Danil-Grigorev commented Aug 12, 2020

Reduce default lease retry rate on 30s which will prevent from heavy writes into etcd at idle, and constrain renew deadline on 90s.

Inspired by openshift/cloud-credential-operator#231

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 12, 2020
@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: This pull request references Bugzilla bug 1858400, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

BUG 1858400: [Performance] Lease refresh period for machine-api-controllers is too high, causes heavy writes to etcd at idle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: This pull request references Bugzilla bug 1858400, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

BUG 1858400: [Performance] Lease refresh period for machine-api-controllers is too high, causes heavy writes to etcd at idle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danil-Grigorev Danil-Grigorev force-pushed the leader-election-retry-less branch from a405869 to 9ffe27a Compare August 12, 2020 17:00
@Danil-Grigorev
Copy link
Author

Danil-Grigorev commented Aug 12, 2020

Set this on 60s, so it will try once per renew deadline.

Danil-Grigorev added 4 commits August 12, 2020 19:16
Prevent machine controllers from writing in etcd at idle too often
by setting 30s retry and 90s deadline on all renewals.
BZ 1858403
Prevent machine controllers from writing in etcd at idle too often
by setting 30s retry and 90s deadline on all renewals.
BZ 1858403
Prevent machine controllers from writing in etcd at idle too often
by setting 60s retry and delay on all renewals.
BZ 1858403
Prevent machine controllers from writing in etcd at idle too often
by setting 60s retry and delay on all renewals.
BZ 1858403
@Danil-Grigorev Danil-Grigorev force-pushed the leader-election-retry-less branch from 9ffe27a to 48d9cce Compare August 12, 2020 17:17
@Danil-Grigorev
Copy link
Author

After some realization, set it on 30s and 90s, so it would retry at least one to twice.

@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: This pull request references Bugzilla bug 1858400, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

BUG 1858400: [Performance] Lease refresh period for machine-api-controllers is too high, causes heavy writes to etcd at idle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do

LeaseDuration = 180 seconds
RenewDeadline = 120 seconds
RetryPeriod = 90 seconds

RenewDeadline needs to be less than LeaseDuration according to the examples.

@elmiko
Copy link
Contributor

elmiko commented Aug 12, 2020

in general i think this is a good patch, but i agree with @michaelgugino that the values should be higher.

@michaelgugino
Copy link
Contributor

Okay, after some more discussion, we determined 120/110/90 might be a better fit.

Don't want the leadership to be too long as if the pod gets moved (eg, upgrades) we don't want operations suspended for too long. 120 seconds would be adequate.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes Danil, this looks good to me but i just have a quick question.

"--v=3",
"--leader-elect=true",
"--leader-elect-lease-duration=90s",
"--leader-elect-lease-duration=120s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only confusing part to me. we set the lease duration in the controller-runtime config options, do we also need to set on the command line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the default if not specified on the CLI. This is useful for development/debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, perfect. thanks for the explanation Mike!

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

"--v=3",
"--leader-elect=true",
"--leader-elect-lease-duration=90s",
"--leader-elect-lease-duration=120s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the default if not specified on the CLI. This is useful for development/debugging.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2020
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 13, 2020

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 fe7cbdc link /test e2e-aws-workers-rhel7
ci/prow/e2e-gcp-operator fe7cbdc link /test e2e-gcp-operator
ci/prow/e2e-gcp fe7cbdc link /test e2e-gcp
ci/prow/e2e-azure fe7cbdc link /test e2e-azure

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit fe883a2 into openshift:master Aug 13, 2020
@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: Some pull requests linked via external trackers have merged: openshift/machine-api-operator#675, openshift/machine-api-operator#649, openshift/cluster-api-provider-ovirt#56, openshift/cluster-api-provider-openstack#109. The following pull requests linked via external trackers have not merged:

Details

In response to this:

BUG 1858400: [Performance] Lease refresh period for machine-api-controllers is too high, causes heavy writes to etcd at idle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants