OCPBUGS-42083: Don't rollout revision until three etcd endpoints are listed by vrutkovs · Pull Request #1743 · openshift/cluster-kube-apiserver-operator

vrutkovs · 2024-09-26T15:43:58Z

Prevent apiserver from pointing to a single etcd endpoint (IP and localhost), which will lose connection when the single etcd is being updated

dgrisonnet · 2024-09-26T16:08:54Z

 			return fmt.Errorf("%v empty in config", strings.Join(requiredPath, "."))
 		}
+
+		if requiredPath[0] == "etcd-servers" {


I'd be curious to see in the tests if this also gated rev 1, though I doubt it will be the case. Worst case scenario we can always add that check as a precondition to the revision controller.

Right, requiredPath[0] is apiServerArguments, updated the code and added unit tests

My comment was about checking in the test run if the config for revision 1 got the required list of etcd-servers or if it bypassed this check. We need to make sure that rev 1 gets the minimum number of endpoints, otherwise we might run into the same problem as before with the first kube-apiserver pod.

Ah, got it. Well, we can't issue a revision if config is not considered correct, so I think we have this covered.

Not sure we already have revision controller unit tests

dgrisonnet · 2024-09-26T16:11:49Z

+			if !ok {
+				return fmt.Errorf("%v is not a slice", strings.Join(requiredPath, "."))
+			}
+			if len(configValSlice) < 3 {


This needs to be gated for SNO as I assume it would only get 2 endpoints. Also, shouldn't we aim for 4 endpoints since we include localhost?

yes, adding an SNO gate too.

shouldn't we aim for 4 endpoints since we include localhost?

Three endpoints - localhost, local IP and etcd IP on any other master - should be sufficient to prevent hangup for 10 mins.
We may never have all 4 endpoints on HA with assisted installer - one of masters is used as a bootstrap, so we shouldn't rely on having all 3 masters available during bootstrap.

I haven’t read the code, but I’m wondering if we only want to stop new revisions during the installation time. If not, what happens when a customer takes a master node out of the pool for service? Does that mean the kas-o will stop installing new revisions? If so, that’s bad, right?

Also, recently there was a change to the RevisionController (library-go) which permits passing a precondition function. New revisions will be allowed by the controller once the precondition fulfils. Maybe you could use it.

If not, what happens when a customer takes a master node out of the pool for service?

Multiple scenarios here:

The node may be physically down for maintenance, but it would still be listed as an endpoint.

Also, while we require three endpoints - its in fact two distinct etcd servers (local IP and localhost are considered single etcd server but two distinct endpoints). So this code won't prevent one master from being torn down during CPMS scale up / down

However there is an etcd restore procedure where the cluster is reduced to a single node to restore to backup and later being scale up to 3. We specifically don't want rolling out new revisions until all three masters are available to ensure all three etcd servers are in sync.

TODO: run e2e to do cpms scale up / etcd disruption to ensure that it works as expected

New revisions will be allowed by the controller once the precondition fulfils

WithRevisionControllerPrecondition might be an alternative to this solution, yes

WithRevisionControllerPrecondition might be an alternative to this solution, yes

Not a blocking comment just wanted to let you know the mechanism exist so that you may consider using it.

So, when an etcd endpoint is removed from the list?

Also, should we add a similar mechanism for the aggregated API servers ?

just wanted to let you know the mechanism exist so that you may consider using it

Right, I think current one is more suited for this situation though - we already have a config validation function so its only natural to extend it. This way we prevent the config from being generated not merely rolled out.

However if this method won't work in some scenarios WithRevisionControllerPrecondition would be a good candidate to try.

should we add a similar mechanism for the aggregated API servers

Its not necessary imo, the issue is happening early on bootstrap when openshift-apiserver is not even created.

If kube-apiserver can reliably write to etcd we'll notice slow etcd rollout via "static pod didn't rollout in 3 mins" test - and a similar solution can be applied to openshift-apiservers

deads2k · 2024-09-26T16:27:41Z

 			return fmt.Errorf("%v empty in config", strings.Join(requiredPath, "."))
 		}
+
+		if requiredPath[0] == "etcd-servers" {


remember to only do this for HA clusters, not single node.

vrutkovs · 2024-09-27T06:51:40Z

/test e2e-azure-ovn
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout 10

openshift-ci · 2024-09-27T06:51:44Z

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f3e3cdf0-7c9c-11ef-9614-4fb038d9905b-0

vrutkovs · 2024-09-27T07:45:57Z

/test e2e-azure-ovn
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout 10

openshift-ci · 2024-09-27T07:46:00Z

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8898f0e0-7ca4-11ef-9d68-6af2d98e5388-0

openshift-ci-robot · 2024-09-27T14:38:11Z

@vrutkovs: This pull request references Jira Issue OCPBUGS-42083, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Prevent apiserver from pointing to a single etcd endpoint (IP and localhost), which will lose connection when the single etcd is being updated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

vrutkovs · 2024-10-01T13:58:31Z

/retest

Hold revision rollouts until we collect three etcd endpoints (localhost, local IP and any other). This ensures that kube-apiserver has another etcd to connect to when local instance is being reconfigured. This functionality is skipped in control plane uses single replica topology

vrutkovs · 2024-10-01T15:40:54Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout 10

openshift-ci · 2024-10-01T15:40:58Z

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c0e55c0-800b-11ef-8f16-62d28d7d5e0a-0

vrutkovs · 2024-10-01T17:07:28Z

/test e2e-azure-ovn

vrutkovs · 2024-10-07T08:37:47Z

/retest-required

vrutkovs · 2024-10-07T13:38:03Z

/test e2e-aws-ovn

vrutkovs · 2024-10-07T15:09:09Z

/test e2e-aws-ovn

vrutkovs · 2024-10-08T11:37:32Z

/test e2e-gcp-operator

dgrisonnet · 2024-10-09T09:29:27Z

/lgtm
/approve

openshift-ci · 2024-10-09T09:29:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgrisonnet,vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-10-09T09:40:54Z

/retest-required

Remaining retests: 0 against base HEAD 3a58fd7 and 2 for PR HEAD 992114a in total

openshift-ci · 2024-10-09T11:57:17Z

@vrutkovs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-operator-single-node	`7067c16`	link	false	`/test e2e-gcp-operator-single-node`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

deads2k · 2024-10-09T12:55:13Z

The problem this is targeting just failed another payload. I think the operator job is failing in general. I'm overriding and will notify the team in slack.

/override ci/prow/e2e-gcp-operator

openshift-ci · 2024-10-09T12:55:20Z

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-gcp-operator

Details

In response to this:

The problem this is targeting just failed another payload. I think the operator job is failing in general. I'm overriding and will notify the team in slack.

/override ci/prow/e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2024-10-09T16:31:51Z

@vrutkovs: Jira Issue OCPBUGS-42083: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-kube-apiserver-operator#1742 is open

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-42083 has not been moved to the MODIFIED state.

Details

In response to this:

Prevent apiserver from pointing to a single etcd endpoint (IP and localhost), which will lose connection when the single etcd is being updated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-10-09T22:07:10Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-kube-apiserver-operator
This PR has been included in build ose-cluster-kube-apiserver-operator-container-v4.18.0-202410092141.p0.ga64efd0.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2024

openshift-ci Bot requested review from p0lyn0mial and sanchezl September 26, 2024 15:45

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024

dgrisonnet reviewed Sep 26, 2024

View reviewed changes

deads2k reviewed Sep 26, 2024

View reviewed changes

vrutkovs force-pushed the minimum-two-etcd-endpoints branch 2 times, most recently from a6baafd to b82c96d Compare September 27, 2024 05:59

vrutkovs force-pushed the minimum-two-etcd-endpoints branch from b82c96d to 7067c16 Compare September 27, 2024 07:45

vrutkovs changed the title ~~WIP Don't rollout revision until three etcd endpoints are listed~~ OCPBUGS-42083: Don't rollout revision until three etcd endpoints are listed Sep 27, 2024

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2024

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Sep 27, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 27, 2024

openshift-ci Bot requested a review from wangke19 September 27, 2024 14:38

openshift-ci-robot mentioned this pull request Sep 30, 2024

OCPBUGS-42083: Include bootstrap IP in etcd endpoints list openshift/library-go#1807

Merged

vrutkovs force-pushed the minimum-two-etcd-endpoints branch from 7067c16 to 992114a Compare October 1, 2024 15:39

openshift-ci Bot assigned dgrisonnet Oct 9, 2024

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2024

openshift-merge-bot Bot merged commit a64efd0 into openshift:master Oct 9, 2024

Conversation

vrutkovs commented Sep 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Sep 27, 2024

Uh oh!

openshift-ci Bot commented Sep 27, 2024

Uh oh!

vrutkovs commented Sep 27, 2024

Uh oh!

openshift-ci Bot commented Sep 27, 2024

Uh oh!

openshift-ci-robot commented Sep 27, 2024

Uh oh!

vrutkovs commented Oct 1, 2024

Uh oh!

vrutkovs commented Oct 1, 2024

Uh oh!

openshift-ci Bot commented Oct 1, 2024

Uh oh!

vrutkovs commented Oct 1, 2024

Uh oh!

vrutkovs commented Oct 7, 2024

Uh oh!

vrutkovs commented Oct 7, 2024

Uh oh!

vrutkovs commented Oct 7, 2024

Uh oh!

vrutkovs commented Oct 8, 2024

Uh oh!

dgrisonnet commented Oct 9, 2024

Uh oh!

openshift-ci Bot commented Oct 9, 2024

Uh oh!

openshift-ci-robot commented Oct 9, 2024

Uh oh!

openshift-ci Bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented Oct 9, 2024

Uh oh!

openshift-ci Bot commented Oct 9, 2024

Uh oh!

openshift-ci-robot commented Oct 9, 2024

Uh oh!

openshift-bot commented Oct 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

openshift-ci Bot commented Oct 9, 2024 •

edited

Loading