Skip to content

OCPBUGS-42083: Don't rollout revision until three etcd endpoints are listed#1743

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
vrutkovs:minimum-two-etcd-endpoints
Oct 9, 2024
Merged

OCPBUGS-42083: Don't rollout revision until three etcd endpoints are listed#1743
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
vrutkovs:minimum-two-etcd-endpoints

Conversation

@vrutkovs
Copy link
Copy Markdown
Contributor

Prevent apiserver from pointing to a single etcd endpoint (IP and localhost), which will lose connection when the single etcd is being updated

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2024
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024
return fmt.Errorf("%v empty in config", strings.Join(requiredPath, "."))
}

if requiredPath[0] == "etcd-servers" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to see in the tests if this also gated rev 1, though I doubt it will be the case. Worst case scenario we can always add that check as a precondition to the revision controller.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, requiredPath[0] is apiServerArguments, updated the code and added unit tests

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about checking in the test run if the config for revision 1 got the required list of etcd-servers or if it bypassed this check. We need to make sure that rev 1 gets the minimum number of endpoints, otherwise we might run into the same problem as before with the first kube-apiserver pod.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Well, we can't issue a revision if config is not considered correct, so I think we have this covered.

Not sure we already have revision controller unit tests

if !ok {
return fmt.Errorf("%v is not a slice", strings.Join(requiredPath, "."))
}
if len(configValSlice) < 3 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be gated for SNO as I assume it would only get 2 endpoints. Also, shouldn't we aim for 4 endpoints since we include localhost?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, adding an SNO gate too.

shouldn't we aim for 4 endpoints since we include localhost?

Three endpoints - localhost, local IP and etcd IP on any other master - should be sufficient to prevent hangup for 10 mins.
We may never have all 4 endpoints on HA with assisted installer - one of masters is used as a bootstrap, so we shouldn't rely on having all 3 masters available during bootstrap.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t read the code, but I’m wondering if we only want to stop new revisions during the installation time. If not, what happens when a customer takes a master node out of the pool for service? Does that mean the kas-o will stop installing new revisions? If so, that’s bad, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, recently there was a change to the RevisionController (library-go) which permits passing a precondition function. New revisions will be allowed by the controller once the precondition fulfils. Maybe you could use it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, what happens when a customer takes a master node out of the pool for service?

Multiple scenarios here:

  • The node may be physically down for maintenance, but it would still be listed as an endpoint.

  • Also, while we require three endpoints - its in fact two distinct etcd servers (local IP and localhost are considered single etcd server but two distinct endpoints). So this code won't prevent one master from being torn down during CPMS scale up / down

  • However there is an etcd restore procedure where the cluster is reduced to a single node to restore to backup and later being scale up to 3. We specifically don't want rolling out new revisions until all three masters are available to ensure all three etcd servers are in sync.

TODO: run e2e to do cpms scale up / etcd disruption to ensure that it works as expected

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New revisions will be allowed by the controller once the precondition fulfils

WithRevisionControllerPrecondition might be an alternative to this solution, yes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithRevisionControllerPrecondition might be an alternative to this solution, yes

Not a blocking comment just wanted to let you know the mechanism exist so that you may consider using it.

So, when an etcd endpoint is removed from the list?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we add a similar mechanism for the aggregated API servers ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to let you know the mechanism exist so that you may consider using it

Right, I think current one is more suited for this situation though - we already have a config validation function so its only natural to extend it. This way we prevent the config from being generated not merely rolled out.

However if this method won't work in some scenarios WithRevisionControllerPrecondition would be a good candidate to try.

should we add a similar mechanism for the aggregated API servers

Its not necessary imo, the issue is happening early on bootstrap when openshift-apiserver is not even created.

If kube-apiserver can reliably write to etcd we'll notice slow etcd rollout via "static pod didn't rollout in 3 mins" test - and a similar solution can be applied to openshift-apiservers

return fmt.Errorf("%v empty in config", strings.Join(requiredPath, "."))
}

if requiredPath[0] == "etcd-servers" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to only do this for HA clusters, not single node.

@vrutkovs vrutkovs force-pushed the minimum-two-etcd-endpoints branch 2 times, most recently from a6baafd to b82c96d Compare September 27, 2024 05:59
@vrutkovs
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-ovn
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout 10

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Sep 27, 2024

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f3e3cdf0-7c9c-11ef-9614-4fb038d9905b-0

@vrutkovs vrutkovs force-pushed the minimum-two-etcd-endpoints branch from b82c96d to 7067c16 Compare September 27, 2024 07:45
@vrutkovs
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-ovn
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout 10

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Sep 27, 2024

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8898f0e0-7ca4-11ef-9d68-6af2d98e5388-0

@vrutkovs vrutkovs changed the title WIP Don't rollout revision until three etcd endpoints are listed OCPBUGS-42083: Don't rollout revision until three etcd endpoints are listed Sep 27, 2024
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Sep 27, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@vrutkovs: This pull request references Jira Issue OCPBUGS-42083, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Prevent apiserver from pointing to a single etcd endpoint (IP and localhost), which will lose connection when the single etcd is being updated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 27, 2024
@openshift-ci openshift-ci Bot requested a review from wangke19 September 27, 2024 14:38
@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 1, 2024

/retest

Hold revision rollouts until we collect three etcd endpoints (localhost,
local IP and any other). This ensures that kube-apiserver has another
etcd to connect to when local instance is being reconfigured.

This functionality is skipped in control plane uses single replica
topology
@vrutkovs vrutkovs force-pushed the minimum-two-etcd-endpoints branch from 7067c16 to 992114a Compare October 1, 2024 15:39
@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 1, 2024

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout 10

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 1, 2024

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-kube-apiserver-rollout

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c0e55c0-800b-11ef-8f16-62d28d7d5e0a-0

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 1, 2024

/test e2e-azure-ovn

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 7, 2024

/retest-required

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 7, 2024

/test e2e-aws-ovn

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 7, 2024

/test e2e-aws-ovn

@vrutkovs
Copy link
Copy Markdown
Contributor Author

vrutkovs commented Oct 8, 2024

/test e2e-gcp-operator

@dgrisonnet
Copy link
Copy Markdown
Member

/lgtm
/approve

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [dgrisonnet,vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD 3a58fd7 and 2 for PR HEAD 992114a in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 9, 2024

@vrutkovs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-operator-single-node 7067c16 link false /test e2e-gcp-operator-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Oct 9, 2024

The problem this is targeting just failed another payload. I think the operator job is failing in general. I'm overriding and will notify the team in slack.

/override ci/prow/e2e-gcp-operator

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 9, 2024

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-gcp-operator

Details

In response to this:

The problem this is targeting just failed another payload. I think the operator job is failing in general. I'm overriding and will notify the team in slack.

/override ci/prow/e2e-gcp-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit a64efd0 into openshift:master Oct 9, 2024
@openshift-ci-robot
Copy link
Copy Markdown

@vrutkovs: Jira Issue OCPBUGS-42083: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-42083 has not been moved to the MODIFIED state.

Details

In response to this:

Prevent apiserver from pointing to a single etcd endpoint (IP and localhost), which will lose connection when the single etcd is being updated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-kube-apiserver-operator
This PR has been included in build ose-cluster-kube-apiserver-operator-container-v4.18.0-202410092141.p0.ga64efd0.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants