Introduce bootstrap scaling strategies #449

ironcladlou · 2020-09-24T13:02:52Z

Before this patch, there were two implicit etcd cluster scaling strategies
applied in different contexts. This patch make those strategies explicit
and adds a new strategy to support additional use cases.

The strategies are:

HAScalingStrategy (default): the etcd cluster will only be scaled up when at least
3 node are available so that HA is enforced at all times. This rule applies
during bootstrapping and in the steady state.
NonHAScalingStrategy means that during bootstrapping, the etcd cluster will
be allowed to scale when at least 2 members are available (which is not HA),
but after bootstrapping any further scaling will require 3 nodes in the same
way as HAScalingStrategy.

This strategy is selected by adding the openshift.io/non-ha-bootstrap
annotation to the openshift-etcd namespace.
UnsafeScalingStrategy means scaling will occur without regard to nodes and
any effect on quorum. Use of this strategy isn't officially tested or supported,
but is made available for ad-hoc use.

This strategy is selected by setting unsupportedConfigOverrides on the
operator config.

NonHAScalingStrategy is new and is intended to support use cases such as
assisted installer which don't use a dedicated bootstrap node and must
tolerate non-HA etcd during bootstrapping only. Currently the way to enable this
strategy is by looking for a marker file during manifest rendering. This is to
provide some measure of support without introducing new installer API.

ironcladlou · 2020-09-24T17:39:36Z

/test e2e-metal-assisted

openshift-ci-robot · 2020-09-24T17:39:51Z

@ironcladlou: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test e2e
/test e2e-aws
/test e2e-azure
/test e2e-disruptive
/test e2e-gcp
/test e2e-metal-ipi
/test e2e-operator
/test e2e-upgrade
/test images
/test unit
/test verify
/test verify-deps

Use /test all to run the following jobs:

pull-ci-openshift-cluster-etcd-operator-master-e2e
pull-ci-openshift-cluster-etcd-operator-master-e2e-disruptive
pull-ci-openshift-cluster-etcd-operator-master-e2e-operator
pull-ci-openshift-cluster-etcd-operator-master-e2e-upgrade
pull-ci-openshift-cluster-etcd-operator-master-images
pull-ci-openshift-cluster-etcd-operator-master-unit
pull-ci-openshift-cluster-etcd-operator-master-verify
pull-ci-openshift-cluster-etcd-operator-master-verify-deps

Details

In response to this:

/test e2e-metal-assisted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2020-09-24T19:51:30Z

/test e2e-metal-assisted

openshift-ci-robot · 2020-09-24T19:51:45Z

@ironcladlou: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test e2e
/test e2e-aws
/test e2e-azure
/test e2e-disruptive
/test e2e-gcp
/test e2e-metal-ipi
/test e2e-operator
/test e2e-upgrade
/test images
/test unit
/test verify
/test verify-deps

Use /test all to run the following jobs:

pull-ci-openshift-cluster-etcd-operator-master-e2e
pull-ci-openshift-cluster-etcd-operator-master-e2e-disruptive
pull-ci-openshift-cluster-etcd-operator-master-e2e-operator
pull-ci-openshift-cluster-etcd-operator-master-e2e-upgrade
pull-ci-openshift-cluster-etcd-operator-master-images
pull-ci-openshift-cluster-etcd-operator-master-unit
pull-ci-openshift-cluster-etcd-operator-master-verify
pull-ci-openshift-cluster-etcd-operator-master-verify-deps

Details

In response to this:

/test e2e-metal-assisted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2020-09-24T19:55:49Z

/test e2e-metal-assisted

hexfusion

looking good added a note around the topic of accountability.

pkg/etcdenvvar/envvarcontroller.go

romfreiman · 2020-09-28T18:21:26Z

@eranco74

ironcladlou · 2020-10-05T18:18:17Z

/test e2e-metal-assisted

romfreiman · 2020-10-08T10:42:53Z

/test e2e-metal-assisted

romfreiman · 2020-10-08T18:26:03Z

/test e2e-metal-assisted

openshift-ci-robot · 2020-10-14T14:57:59Z

@ironcladlou: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-disruptive	`f37a8d4`	link	`/test e2e-disruptive`
ci/prow/e2e-metal-assisted	`f37a8d4`	link	`/test e2e-metal-assisted`
ci/prow/e2e	`2d86f3c`	link	`/test e2e`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

eranco74 · 2020-10-15T09:03:49Z

pkg/etcdenvvar/envvarcontroller.go

+	case isManagedByAssistedInstaller:
+		// When managed by assisted installer, tolerate unsafe conditions only up
+		// until bootstrap is complete, and then enforce as in the supported case.
+		if nodeCount < 3 && bootstrapComplete {


In the assisted installer flow there is a time gap between bootstrap complete and the time the 3rd master joins.
(the bootstrap node pivots to be the 3rd master once bootstrap is completed)

bootstrap being bootkube service running on the bootstrap node

I agree that this decision has implications. technically speaking the install-config would represent 2 master replicas. In which case install complete should require 2 master nodes. Then scale up to 3 would be a secondary process to the install. Understanding a point in time where we are installComplete is important.

After install status is complete the operator will go into a degraded state until the cluster has 3 or more master nodes.

So this allows a cluster to achieve install complete of less than 3 master nodes but not tolerate less than three after that point. So ideally you would resolve the scaling before install-complete but it would not be required. In the case where no 3rd master ever joined the cluster, it would remain degraded with clear message.

[1] openshift/enhancements#480

hexfusion · 2020-10-22T12:09:45Z

/test e2e-metal-assisted

romfreiman · 2020-10-22T13:42:09Z

@carbonin can u please check why it failed. The logs should be there. @YuviGold can you point out where are they?

romfreiman · 2020-10-22T13:42:51Z

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/449/pull-ci-openshift-cluster-etcd-operator-master-e2e-metal-assisted/1319249678662897664/artifacts/e2e-metal-assisted/baremetalds-assisted-gather/2020-10-22_12:29:43_cdefedab-663f-40df-be58-ec467ece6ab6/

YuviGold · 2020-10-22T13:53:15Z

@carbonin can u please check why it failed. The logs should be there. @YuviGold can you point out where are they?

artifacts -> e2e-metal-assisted -> baremetalds-assisted-gather -> cluster logs [installation-started-at_cluster-id]

romfreiman · 2020-10-22T14:01:10Z

seems that masters are good:
msg="Found 2 master nodes:
map[test-infra-cluster-assisted-installer-master-0:[{Type:MemoryPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:18 +0000 UTC Reason:KubeletHasSufficientMemory Message:kubelet has sufficient memory available} {Type:DiskPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:18 +0000 UTC Reason:KubeletHasNoDiskPressure Message:kubelet has no disk pressure} {Type:PIDPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:18 +0000 UTC Reason:KubeletHasSufficientPID Message:kubelet has sufficient PID available}
{Type:Ready Status:True LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:35:29 +0000 UTC Reason:KubeletReady Message:kubelet is posting ready status}]

test-infra-cluster-assisted-installer-master-2:[{Type:MemoryPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:19 +0000 UTC Reason:KubeletHasSufficientMemory Message:kubelet has sufficient memory available} {Type:DiskPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:19 +0000 UTC Reason:KubeletHasNoDiskPressure Message:kubelet has no disk pressure} {Type:PIDPressure Status:False LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:34:19 +0000 UTC Reason:KubeletHasSufficientPID Message:kubelet has sufficient PID available}
{Type:Ready Status:True LastHeartbeatTime:2020-10-22 12:35:29 +0000 UTC LastTransitionTime:2020-10-22 12:35:29 +0000 UTC Reason:KubeletReady Message:kubelet is posting ready status}]]"

romfreiman · 2020-10-22T14:02:22Z

but bootkube timed out

hexfusion · 2020-10-22T14:22:49Z

/test e2e-metal-assisted

ironcladlou · 2020-11-30T16:00:02Z

/retest

hexfusion · 2020-11-30T17:00:00Z

e2e-metal-assisted failure is due to pending fixes in ci.

ref: openshift/cluster-baremetal-operator#75

hexfusion · 2020-12-01T11:56:44Z

fix has merged openshift/cluster-baremetal-operator#81

/test e2e-metal-assisted

hexfusion · 2020-12-01T13:23:46Z

based on a basic review of ci/prow/e2e-metal-assisted[1] which passed etcd appears stable at term 5.

/lgtm

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/449/pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic-upgrade/1331236135539576832

openshift-ci-robot · 2020-12-01T13:24:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, ironcladlou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [hexfusion,ironcladlou]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2020-12-01T13:24:24Z

/hold cancel
/retest

ironcladlou · 2020-12-03T15:30:10Z

/retest

hexfusion · 2020-12-04T16:40:11Z

/test e2e-agnostic

hexfusion · 2020-12-04T17:29:06Z

infra
/test e2e-agnostic

hexfusion · 2020-12-04T17:35:20Z

infra....
/test e2e-agnostic

hexfusion · 2020-12-04T18:57:58Z

/test e2e-agnostic

hexfusion · 2020-12-05T12:31:22Z

/test e2e-agnostic

openshift-bot · 2020-12-05T17:22:50Z

/retest