create: add check for cluster operator stability #6124

patrickdillon · 2022-07-14T19:28:55Z

Following up from Cluster Lifecycle call in late February, this is a PoC implementation of having the installer check cluster operators for stability. There is openshift/enhancements#1189 for discussion.

Output on a successful run looks like this.

DEBUG Cluster is initialized                       
INFO Waiting up to 5m0s (until 5:48PM) to ensure each cluster operator has finished progressing... 
I0714 17:43:50.733286      13 request.go:665] Waited for 1.021801023s due to client-side throttling, not priority and fairness, request: GET:https://api.pad.illon07141318.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators?limit=500&resourceVersion=0
INFO All cluster operators have completed progressing 
INFO Waiting up to 10m0s (until 5:53PM) for the openshift-console route to be created...

Note, I would like to get ride of the client-side throttling output, but not sure how to do that.

Input is extremely welcome as this is not my normal coding domain.

patrickdillon · 2022-07-14T19:29:28Z

/cc @deads2k

patrickdillon · 2022-07-14T20:52:00Z

/hold
For discussion

cmd/openshift-install/create.go

patrickdillon · 2022-10-01T11:14:17Z

/retest

patrickdillon · 2022-10-01T11:19:56Z

Openstack e2e is failing stability check
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6124/pull-ci-openshift-installer-master-e2e-openstack/1576012974525517824
Probably need to debug and add dumps

patrickdillon · 2022-10-02T18:46:25Z

Added a debug statement (without testing 👼) in eadfe4e

cmd/openshift-install/create.go

patrickdillon · 2022-10-02T23:23:33Z

time="2022-10-02T21:27:35Z" level=info msg="Waiting up to 5m0s (until 9:32PM) to ensure each cluster operator has finished progressing..."
time="2022-10-02T21:27:35Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2022-10-02 21:27:10 +0000 UTC DurationSinceTransition=25.460499834s"
time="2022-10-02T21:32:35Z" level=error msg="Cluster operator kube-apiserver does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 9\""

from: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6124/pull-ci-openshift-installer-master-e2e-azure-ovn/1576644439625961472

It seems like this operator was only checked once... will add some more debug statements to check this.

patrickdillon · 2022-10-03T10:56:00Z

/test all

patrickdillon · 2022-10-03T15:07:59Z

time="2022-10-03T12:18:56Z" level=info msg="Waiting up to 5m0s (until 12:23PM) to ensure each cluster operator has finished progressing..."
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator baremetal is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator authentication is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2022-10-03 12:18:29 +0000 UTC DurationSinceTransition=27.959505618s"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator storage is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator kube-scheduler is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator cloud-controller-manager is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator kube-controller-manager is Progressing=True LastTransitionTime=2022-10-03 12:11:01 +0000 UTC DurationSinceTransition=475.962649037s"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator insights is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator cloud-credential is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator kube-storage-version-migrator is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator cluster-autoscaler is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator machine-api is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator service-ca is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator config-operator is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator machine-approver is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=152.270874612s"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator openshift-samples is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator marketplace is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator console is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator image-registry is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator node-tuning is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator ingress is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator operator-lifecycle-manager-packageserver is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator control-plane-machine-set is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator openshift-controller-manager is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator operator-lifecycle-manager is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator monitoring is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator machine-config is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator operator-lifecycle-manager-catalog is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator dns is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator openshift-apiserver is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator csi-snapshot-controller is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator network is stable"
time="2022-10-03T12:19:17Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=171.005530042s"
time="2022-10-03T12:19:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=184.629099327s"
time="2022-10-03T12:19:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=184.687712695s"
time="2022-10-03T12:20:05Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=219.355492607s"
time="2022-10-03T12:20:08Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=222.773698137s"
time="2022-10-03T12:20:15Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=229.642400053s"
time="2022-10-03T12:20:15Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=229.662644635s"
time="2022-10-03T12:20:39Z" level=debug msg="Cluster Operator kube-controller-manager is Progressing=False LastTransitionTime=2022-10-03 12:20:39 +0000 UTC DurationSinceTransition=0.656392816s"
time="2022-10-03T12:21:14Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=288.268508085s"
time="2022-10-03T12:21:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=304.623004056s"
time="2022-10-03T12:21:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=304.657572518s"
time="2022-10-03T12:22:03Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=337.1953858s"
time="2022-10-03T12:22:13Z" level=debug msg="Cluster Operator etcd is Progressing=False LastTransitionTime=2022-10-03 12:22:13 +0000 UTC DurationSinceTransition=0.899676568s"
time="2022-10-03T12:22:15Z" level=debug msg="Cluster Operator etcd is Progressing=False LastTransitionTime=2022-10-03 12:22:13 +0000 UTC DurationSinceTransition=2.6364960809999998s"
time="2022-10-03T12:22:15Z" level=debug msg="Cluster Operator etcd is Progressing=False LastTransitionTime=2022-10-03 12:22:13 +0000 UTC DurationSinceTransition=2.667407134s"
time="2022-10-03T12:23:56Z" level=error msg="Cluster operator kube-apiserver does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 8\""
time="2022-10-03T12:23:56Z" level=error msg="Cluster operator kube-controller-manager does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 6\""
time="2022-10-03T12:23:56Z" level=error msg="Cluster operator etcd does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 8\\nEtcdMembersProgressing: No unstarted etcd members found\""

Ok: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6124/pull-ci-openshift-installer-master-e2e-aws-ovn/1576888802176143360 has a bit better debugging visibility.

It looks like etcd was correctly caught by the stability check (although I wonder if it just needed a little more time?) But I'm a little confused about whether a bug in my code is causing, in this case, the api server and kcm, to not be checked. I only see one check for those, while etcd has multiple frequent checks.

deads2k · 2022-10-04T16:21:47Z

wrote patrickdillon#1 to demonstrate the replaying watch that this logic needs to stay clean and use a watch without writing an entire controller.

patrickdillon · 2022-10-05T20:05:53Z

/hold cancel

patrickdillon · 2022-10-06T00:01:32Z

The most recent ovirt test shows that the most recent commit fixes the bug: kube-apiserver is not stable at the beginning of the check. At the end of the 5 minutes, it is stable:

time="2022-10-05T20:42:50Z" level=debug msg="Cluster is initialized"
time="2022-10-05T20:42:50Z" level=info msg="Waiting up to 5m0s (until 8:47PM) to ensure each cluster operator has finished progressing..."
time="2022-10-05T20:42:50Z" level=debug msg="Cluster Operator kube-scheduler is stable"
[...]
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator dns is stable"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2022-10-05 20:42:26 +0000 UTC DurationSinceTransition=28s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 7"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator marketplace is stable"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator console is stable"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator image-registry is stable"
time="2022-10-05T20:47:50Z" level=debug msg="Cluster operator kube-apiserver is now stable: Progressing=False LastTransitionTime=2022-10-05 20:42:26 +0000 UTC DurationSinceTransition=324s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 7"

So the most recent code is essentially going to impose a 5 minute wait if the operators are not stable during the initial check. I had to look through four or five tests to find this case, where the initial stability check failed. In all of those cases, the operators are stable and there is no 5 minute penalty.

In this case, kube api-server missed by two seconds. If we see this sort of case frequently but we are not catching errors, we could drop the 5 minute duration, so the penalty is not as severe.

patrickdillon · 2022-10-06T20:41:27Z

/retest

cmd/openshift-install/create.go

deads2k · 2022-10-06T21:10:06Z

cmd/openshift-install/create.go

kube style is normally

if statusErr == nil{ return nil } if !errors.Is(waittimeout){ } if meetsStabilityThreshold{ return nil } return err

Ok. I have adopted this suggestion.

I'm not sure if I'm trying to be too clever, but the use of a closure here requires the code to be a little different. I need to always return err, which may be nil. I have added a comment to try to clarify this. The rationale for using this technique was that it would simplify the calling code (because it would no longer need to track whether an operator was unstable). But I almost forgot this was the point when refactoring coStabilityChecker, so I wonder if this is a good idea...

In any case, should be ok now!

deads2k · 2022-10-06T21:10:57Z

I like this change. Minor style comments that don't match kube. We inherited the installer, so I'm not sure of its standards.

Adds a check to see whether each cluster operator has stopped progressing for at least 30 seconds. There is a five minute period where operators can meet this threshold. This check prevents against a class of errors where operators report themselves as Available=true but continue to progress and are not fully functional.

sdodson · 2022-10-07T03:03:18Z

/approve

patrickdillon · 2022-10-07T13:39:17Z

/retest

jhixson74 · 2022-10-14T21:55:43Z

Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me.

patrickdillon · 2022-10-17T18:47:32Z

/refresh

patrickdillon · 2022-10-17T19:18:29Z

Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me.

@jhixson74 I talked to @sdodson & @deads2k and it looks like this is ready to go in, if you want to review or LGTM that would be great. Thank you!

deads2k · 2022-10-17T20:04:37Z

Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me.

/lgtm

looks good to me too.

openshift-ci-robot · 2022-10-17T20:11:59Z

/retest-required

Remaining retests: 0 against base HEAD 045b952 and 2 for PR HEAD 088a068 in total

jhixson74 · 2022-10-17T22:21:07Z

/lgtm
/approve

openshift-ci · 2022-10-17T22:21:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhixson74, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jhixson74,sdodson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-10-17T23:46:38Z

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-vsphere	4d652d9d30289f14d0e9f64a428c71f56da01273	link	false	`/test okd-scos-e2e-vsphere`
ci/prow/okd-scos-e2e-aws-ovn	`088a068`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/okd-scos-e2e-aws-upgrade	`088a068`	link	false	`/test okd-scos-e2e-aws-upgrade`
ci/prow/e2e-ibmcloud-ovn	`088a068`	link	false	`/test e2e-ibmcloud-ovn`
ci/prow/e2e-libvirt	`088a068`	link	false	`/test e2e-libvirt`
ci/prow/okd-e2e-aws-upgrade	`088a068`	link	false	`/test okd-e2e-aws-upgrade`
ci/prow/okd-scos-e2e-gcp	4d652d9d30289f14d0e9f64a428c71f56da01273	link	false	`/test okd-scos-e2e-gcp`
ci/prow/okd-scos-e2e-gcp-ovn-upgrade	4d652d9d30289f14d0e9f64a428c71f56da01273	link	false	`/test okd-scos-e2e-gcp-ovn-upgrade`
ci/prow/okd-e2e-aws-ovn	`088a068`	link	false	`/test okd-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sdodson · 2022-10-18T01:45:58Z

/retest-required
repo errors are generally transient in nature

patrickdillon mentioned this pull request Jul 14, 2022

Installer: check operators for stability openshift/enhancements#1189

Merged

openshift-ci bot requested review from deads2k, fabianofranz and jhixson74 July 14, 2022 19:29

patrickdillon force-pushed the settle-operators branch 2 times, most recently from 50021d7 to b40bd1b Compare July 14, 2022 19:33

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2022

deads2k reviewed Jul 29, 2022

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2022

patrickdillon force-pushed the settle-operators branch from b40bd1b to a187bae Compare October 1, 2022 00:52

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2022

patrickdillon force-pushed the settle-operators branch from a187bae to d064e4c Compare October 1, 2022 00:55

patrickdillon commented Oct 2, 2022

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

patrickdillon force-pushed the settle-operators branch from eadfe4e to 98f2a32 Compare October 2, 2022 23:28

patrickdillon force-pushed the settle-operators branch from 98f2a32 to 75e409b Compare October 3, 2022 15:24

tjungblu mentioned this pull request Oct 5, 2022

timeout for quorum check openshift/cluster-etcd-operator#942

Closed

patrickdillon force-pushed the settle-operators branch 2 times, most recently from 3b09356 to 69d24e7 Compare October 5, 2022 19:02

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2022

patrickdillon force-pushed the settle-operators branch from 69d24e7 to 4d652d9 Compare October 6, 2022 18:57

deads2k reviewed Oct 6, 2022

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

deads2k reviewed Oct 6, 2022

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

deads2k reviewed Oct 6, 2022

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

deads2k reviewed Oct 6, 2022

View reviewed changes

patrickdillon force-pushed the settle-operators branch from 4d652d9 to 088a068 Compare October 7, 2022 01:24

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2022

openshift-ci bot assigned deads2k Oct 17, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2022

openshift-ci bot assigned jhixson74 Oct 17, 2022

openshift-merge-robot merged commit 927dd43 into openshift:master Oct 18, 2022

xueqzhan mentioned this pull request Oct 20, 2022

Revert "create: add check for cluster operator stability" #6503

Merged

create: add check for cluster operator stability #6124

create: add check for cluster operator stability #6124

Uh oh!

Conversation

patrickdillon commented Jul 14, 2022

Uh oh!

patrickdillon commented Jul 14, 2022

Uh oh!

patrickdillon commented Jul 14, 2022

Uh oh!

Uh oh!

patrickdillon commented Oct 1, 2022

Uh oh!

patrickdillon commented Oct 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickdillon commented Oct 2, 2022

Uh oh!

Uh oh!

patrickdillon commented Oct 2, 2022

Uh oh!

patrickdillon commented Oct 3, 2022

Uh oh!

patrickdillon commented Oct 3, 2022

Uh oh!

deads2k commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickdillon commented Oct 5, 2022

Uh oh!

patrickdillon commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickdillon commented Oct 6, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deads2k Oct 6, 2022

Choose a reason for hiding this comment

Uh oh!

patrickdillon Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

deads2k commented Oct 6, 2022

Uh oh!

sdodson commented Oct 7, 2022

Uh oh!

patrickdillon commented Oct 7, 2022

Uh oh!

jhixson74 commented Oct 14, 2022

Uh oh!

patrickdillon commented Oct 17, 2022

Uh oh!

patrickdillon commented Oct 17, 2022

Uh oh!

deads2k commented Oct 17, 2022

Uh oh!

openshift-ci-robot commented Oct 17, 2022

Uh oh!

jhixson74 commented Oct 17, 2022

Uh oh!

openshift-ci bot commented Oct 17, 2022

Uh oh!

openshift-ci bot commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdodson commented Oct 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

patrickdillon commented Oct 1, 2022 •

edited

Loading

deads2k commented Oct 4, 2022 •

edited

Loading

patrickdillon commented Oct 6, 2022 •

edited

Loading

openshift-ci bot commented Oct 17, 2022 •

edited

Loading