Skip to content

Conversation

@ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Sep 24, 2020

Before this patch, there were two implicit etcd cluster scaling strategies
applied in different contexts. This patch make those strategies explicit
and adds a new strategy to support additional use cases.

The strategies are:

  • HAScalingStrategy (default): the etcd cluster will only be scaled up when at least
    3 node are available so that HA is enforced at all times. This rule applies
    during bootstrapping and in the steady state.

  • NonHAScalingStrategy means that during bootstrapping, the etcd cluster will
    be allowed to scale when at least 2 members are available (which is not HA),
    but after bootstrapping any further scaling will require 3 nodes in the same
    way as HAScalingStrategy.

    This strategy is selected by adding the openshift.io/non-ha-bootstrap
    annotation to the openshift-etcd namespace.

  • UnsafeScalingStrategy means scaling will occur without regard to nodes and
    any effect on quorum. Use of this strategy isn't officially tested or supported,
    but is made available for ad-hoc use.

    This strategy is selected by setting unsupportedConfigOverrides on the
    operator config.

NonHAScalingStrategy is new and is intended to support use cases such as
assisted installer which don't use a dedicated bootstrap node and must
tolerate non-HA etcd during bootstrapping only. Currently the way to enable this
strategy is by looking for a marker file during manifest rendering. This is to
provide some measure of support without introducing new installer API.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2020
@ironcladlou
Copy link
Contributor Author

/test e2e-metal-assisted

@openshift-ci-robot
Copy link

@ironcladlou: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e
  • /test e2e-aws
  • /test e2e-azure
  • /test e2e-disruptive
  • /test e2e-gcp
  • /test e2e-metal-ipi
  • /test e2e-operator
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

Use /test all to run the following jobs:

  • pull-ci-openshift-cluster-etcd-operator-master-e2e
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-disruptive
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-operator
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-upgrade
  • pull-ci-openshift-cluster-etcd-operator-master-images
  • pull-ci-openshift-cluster-etcd-operator-master-unit
  • pull-ci-openshift-cluster-etcd-operator-master-verify
  • pull-ci-openshift-cluster-etcd-operator-master-verify-deps
Details

In response to this:

/test e2e-metal-assisted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor Author

/test e2e-metal-assisted

@openshift-ci-robot
Copy link

@ironcladlou: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e
  • /test e2e-aws
  • /test e2e-azure
  • /test e2e-disruptive
  • /test e2e-gcp
  • /test e2e-metal-ipi
  • /test e2e-operator
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

Use /test all to run the following jobs:

  • pull-ci-openshift-cluster-etcd-operator-master-e2e
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-disruptive
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-operator
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-upgrade
  • pull-ci-openshift-cluster-etcd-operator-master-images
  • pull-ci-openshift-cluster-etcd-operator-master-unit
  • pull-ci-openshift-cluster-etcd-operator-master-verify
  • pull-ci-openshift-cluster-etcd-operator-master-verify-deps
Details

In response to this:

/test e2e-metal-assisted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor Author

/test e2e-metal-assisted

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good added a note around the topic of accountability.

@ironcladlou ironcladlou changed the title WIP: relax HA assertions when bootstrapping within Assisted Installer… WIP: tolerate loss of HA during assisted installer bootstrap Sep 28, 2020
@romfreiman
Copy link

@eranco74

@ironcladlou
Copy link
Contributor Author

/test e2e-metal-assisted

2 similar comments
@romfreiman
Copy link

/test e2e-metal-assisted

@romfreiman
Copy link

/test e2e-metal-assisted

@openshift-ci-robot
Copy link

@ironcladlou: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-disruptive f37a8d4 link /test e2e-disruptive
ci/prow/e2e-metal-assisted f37a8d4 link /test e2e-metal-assisted
ci/prow/e2e 2d86f3c link /test e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

case isManagedByAssistedInstaller:
// When managed by assisted installer, tolerate unsafe conditions only up
// until bootstrap is complete, and then enforce as in the supported case.
if nodeCount < 3 && bootstrapComplete {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the assisted installer flow there is a time gap between bootstrap complete and the time the 3rd master joins.
(the bootstrap node pivots to be the 3rd master once bootstrap is completed)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootstrap being bootkube service running on the bootstrap node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this decision has implications. technically speaking the install-config would represent 2 master replicas. In which case install complete should require 2 master nodes. Then scale up to 3 would be a secondary process to the install. Understanding a point in time where we are installComplete is important.

After install status is complete the operator will go into a degraded state until the cluster has 3 or more master nodes.

So this allows a cluster to achieve install complete of less than 3 master nodes but not tolerate less than three after that point. So ideally you would resolve the scaling before install-complete but it would not be required. In the case where no 3rd master ever joined the cluster, it would remain degraded with clear message.

[1] openshift/enhancements#480

@hexfusion
Copy link
Contributor

/test e2e-metal-assisted

@romfreiman
Copy link

@carbonin can u please check why it failed. The logs should be there. @YuviGold can you point out where are they?

@YuviGold
Copy link

@carbonin can u please check why it failed. The logs should be there. @YuviGold can you point out where are they?

artifacts -> e2e-metal-assisted -> baremetalds-assisted-gather -> cluster logs [installation-started-at_cluster-id]

@romfreiman
Copy link

seems that masters are good:
msg="Found 2 master nodes:
map[test-infra-cluster-assisted-installer-master-0:[{Type:MemoryPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:18 +0000 UTC Reason:KubeletHasSufficientMemory Message:kubelet has sufficient memory available} {Type:DiskPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:18 +0000 UTC Reason:KubeletHasNoDiskPressure Message:kubelet has no disk pressure} {Type:PIDPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:18 +0000 UTC Reason:KubeletHasSufficientPID Message:kubelet has sufficient PID available}
{Type:Ready Status:True LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:35:29 +0000 UTC Reason:KubeletReady Message:kubelet is posting ready status}]

test-infra-cluster-assisted-installer-master-2:[{Type:MemoryPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:19 +0000 UTC Reason:KubeletHasSufficientMemory Message:kubelet has sufficient memory available} {Type:DiskPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:19 +0000 UTC Reason:KubeletHasNoDiskPressure Message:kubelet has no disk pressure} {Type:PIDPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:19 +0000 UTC Reason:KubeletHasSufficientPID Message:kubelet has sufficient PID available}
{Type:Ready Status:True LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:35:29 +0000 UTC Reason:KubeletReady Message:kubelet is posting ready status}]]"

@romfreiman
Copy link

but bootkube timed out

@hexfusion
Copy link
Contributor

/test e2e-metal-assisted

@hexfusion hexfusion mentioned this pull request Oct 29, 2020
@ironcladlou ironcladlou force-pushed the adaptive-bootstrap-ha branch from b52cf85 to 408284a Compare November 13, 2020 18:57
@ironcladlou ironcladlou changed the title WIP: tolerate loss of HA during assisted installer bootstrap Enable etcd HA to be temporarily disabled during bootstrap Nov 13, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2020
@ironcladlou ironcladlou force-pushed the adaptive-bootstrap-ha branch 2 times, most recently from 46612a7 to 84876ba Compare November 13, 2020 19:03
@ironcladlou
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

e2e-metal-assisted failure is due to pending fixes in ci.

ref: openshift/cluster-baremetal-operator#75

@hexfusion
Copy link
Contributor

fix has merged openshift/cluster-baremetal-operator#81

/test e2e-metal-assisted

@hexfusion
Copy link
Contributor

hexfusion commented Dec 1, 2020

based on a basic review of ci/prow/e2e-metal-assisted[1] which passed etcd appears stable at term 5.

/lgtm

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/449/pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic-upgrade/1331236135539576832

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 1, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, ironcladlou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [hexfusion,ironcladlou]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hexfusion
Copy link
Contributor

/hold cancel
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2020
@ironcladlou
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

/test e2e-agnostic

@hexfusion
Copy link
Contributor

infra
/test e2e-agnostic

@hexfusion
Copy link
Contributor

infra....
/test e2e-agnostic

@hexfusion
Copy link
Contributor

/test e2e-agnostic

1 similar comment
@hexfusion
Copy link
Contributor

/test e2e-agnostic

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@hexfusion
Copy link
Contributor

strange error case here is this just prom down?

Dec  5 18:22:38.215: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/etcd/leader_changes.go:24]: Unexpected error:
    <*v1.Error | 0xc001a78900>: {
        Type: "server_error",
        Msg: "server error: 503",
        Detail: "<html>\r\n  <head>\r\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\r\n\r\n  <style type=\"text/css\">\r\n  /*!\r\n   * Bootstrap v3.3.5 (http://getbootstrap.com)\r\n   * Copyright 2011-2015 Twitter, Inc.\r\n   * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE)\r\n   */\r\n  /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */\r\n  html {\r\n    font-family: sans-serif;\r\n    -ms-text-size-adjust: 100%;\r\n    -webkit-text-size-adjust: 100%;\r\n  }\r\n  body {\r\n    margin: 0;\r\n  }\r\n  h1 {\r\n    font-size: 1.7em;\r\n    font-weight: 400;\r\n    line-height: 1.3;\r\n    margin: 0.68em 0;\r\n  }\r\n  * {\r\n    -webkit-box-sizing: border-box;\r\n    -moz-box-sizing: border-box;\r\n    box-sizing: border-box;\r\n  }\r\n  *:before,\r\n  *:after {\r\n    -webkit-box-sizing: border-box;\r\n    -moz-box-sizing: border-box;\r\n    box-sizing: border-box;\r\n  }\r\n  html {\r\n    -webkit-tap-highlight-color: rgba(0, 0, 0, 0);\r\n  }\r\n  body {\r\n    font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;\r\n    line-height: 1.66666667;\r\n    font-size: 13px;\r\n    color: #333333;\r\n    background-color: #ffffff;\r\n    margin: 2em 1em;\r\n  }\r\n  p {\r\n    margin: 0 0 10px;\r\n    font-size: 13px;\r\n  }\r\n  .alert.alert-info {\r\n    padding: 15px;\r\n    margin-bottom: 20px;\r\n    border: 1px solid transparent;\r\n    background-color: #f5f5f5;\r\n    border-color: #8b8d8f;\r\n    color: #363636;\r\n    margin-top: 30px;\r\n  }\r\n  .alert p {\r\n    padding-left: 35px;\r\n  }\r\n  a {\r\n    color: #0088ce;\r\n  }\r\n\r\n  ul {\r\n    position: relative;\r\n    padding-left: 51px;\r\n  }\r\n  p.info {\r\n    position: relative;\r\n    font-size: 15px;\r\n    margin-bottom: 10px;\r\n  }\r\n  p.info:before, p.info:after {\r\n    content: \"\";\r\n    position: absolute;\r\n    top: 9%;\r\n    left: 0;\r\n  }\r\n  p.info:before {\r\n    content: \"i\";\r\n    left: 3px;\r\n    width: 20px;\r\n    height: 20px;\r\n    font-family: serif;\r\n    font-size: 15px;\r\n    font-weight: bold;\r\n    line-height: 21px;\r\n    text-align: center;\r\n    color: #fff;\r\n    background: #4d5258;\r\n    border-radius: 16px;\r\n  }\r\n\r\n  @media (min-width: 768px) {\r\n    body {\r\n      margin: 4em 3em;\r\n    }\r\n    h1 {\r\n      font-size: 2.15em;}\r\n  }\r\n\r\n  </style>\r\n  </head>\r\n  <body>\r\n    <div>\r\n      <h1>Application is not available</h1>\r\n      <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>\r\n\r\n      <div class=\"alert alert-info\">\r\n        <p class=\"info\">\r\n          Possible reasons you are seeing this page:\r\n        </p>\r\n        <ul>\r\n          <li>\r\n            <strong>The host doesn't exist.</strong>\r\n            Make sure the hostname was typed correctly and that a route matching this hostname exists.\r\n          </li>\r\n          <li>\r\n            <strong>The host exists, but doesn't have a matching path.</strong>\r\n            Check if the URL path was typed correctly and that the route was created using the desired path.\r\n          </li>\r\n          <li>\r\n            <strong>Route and path matches, but all pods are down.</strong>\r\n            Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.\r\n          </li>\r\n        </ul>\r\n      </div>\r\n    </div>\r\n  </body>\r\n</html>\r\n",
    }
    server_er

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Dec 6, 2020

@ironcladlou: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-disruptive 2c2d3f5 link /test e2e-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 9ed77ed into openshift:master Dec 6, 2020
carbonin added a commit to carbonin/assisted-installer-1 that referenced this pull request Dec 14, 2020
Going forward we will use a marker file written to the bootstrap
node. This is implemented in the service in  openshift/assisted-service#672
and the etcd operator in openshift/cluster-etcd-operator#449

Fixes MGMT-2454
carbonin added a commit to carbonin/assisted-installer-1 that referenced this pull request Dec 14, 2020
Going forward we will use a marker file written to the bootstrap
node. This is implemented in the service in  openshift/assisted-service#672
and the etcd operator in openshift/cluster-etcd-operator#449

Fixes MGMT-2454
carbonin added a commit to carbonin/assisted-installer-1 that referenced this pull request Dec 15, 2020
Going forward we will use a marker file written to the bootstrap
node. This is implemented in the service in  openshift/assisted-service#672
and the etcd operator in openshift/cluster-etcd-operator#449

Fixes MGMT-2454
carbonin added a commit to carbonin/assisted-installer-1 that referenced this pull request Dec 16, 2020
Going forward we will use a marker file written to the bootstrap
node. This is implemented in the service in  openshift/assisted-service#672
and the etcd operator in openshift/cluster-etcd-operator#449

Fixes MGMT-2454
openshift-merge-robot pushed a commit to openshift/assisted-installer that referenced this pull request Dec 16, 2020
Going forward we will use a marker file written to the bootstrap
node. This is implemented in the service in  openshift/assisted-service#672
and the etcd operator in openshift/cluster-etcd-operator#449

Fixes MGMT-2454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.