Skip to content

Conversation

@patrickdillon
Copy link
Contributor

Following up from Cluster Lifecycle call in late February, this is a PoC implementation of having the installer check cluster operators for stability. There is openshift/enhancements#1189 for discussion.

Output on a successful run looks like this.

DEBUG Cluster is initialized                       
INFO Waiting up to 5m0s (until 5:48PM) to ensure each cluster operator has finished progressing... 
I0714 17:43:50.733286      13 request.go:665] Waited for 1.021801023s due to client-side throttling, not priority and fairness, request: GET:https://api.pad.illon07141318.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators?limit=500&resourceVersion=0
INFO All cluster operators have completed progressing 
INFO Waiting up to 10m0s (until 5:53PM) for the openshift-console route to be created... 

Note, I would like to get ride of the client-side throttling output, but not sure how to do that.

Input is extremely welcome as this is not my normal coding domain.

@patrickdillon
Copy link
Contributor Author

/cc @deads2k

@patrickdillon patrickdillon force-pushed the settle-operators branch 2 times, most recently from 50021d7 to b40bd1b Compare July 14, 2022 19:33
@patrickdillon
Copy link
Contributor Author

/hold
For discussion

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2022
@patrickdillon
Copy link
Contributor Author

/retest

@patrickdillon
Copy link
Contributor Author

patrickdillon commented Oct 1, 2022

@patrickdillon
Copy link
Contributor Author

Added a debug statement (without testing 👼) in eadfe4e

@patrickdillon
Copy link
Contributor Author

time="2022-10-02T21:27:35Z" level=info msg="Waiting up to 5m0s (until 9:32PM) to ensure each cluster operator has finished progressing..."
time="2022-10-02T21:27:35Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2022-10-02 21:27:10 +0000 UTC DurationSinceTransition=25.460499834s"
time="2022-10-02T21:32:35Z" level=error msg="Cluster operator kube-apiserver does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 9\""

from: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6124/pull-ci-openshift-installer-master-e2e-azure-ovn/1576644439625961472

It seems like this operator was only checked once... will add some more debug statements to check this.

@patrickdillon
Copy link
Contributor Author

/test all

@patrickdillon
Copy link
Contributor Author

time="2022-10-03T12:18:56Z" level=info msg="Waiting up to 5m0s (until 12:23PM) to ensure each cluster operator has finished progressing..."
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator baremetal is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator authentication is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2022-10-03 12:18:29 +0000 UTC DurationSinceTransition=27.959505618s"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator storage is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator kube-scheduler is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator cloud-controller-manager is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator kube-controller-manager is Progressing=True LastTransitionTime=2022-10-03 12:11:01 +0000 UTC DurationSinceTransition=475.962649037s"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator insights is stable"
time="2022-10-03T12:18:56Z" level=debug msg="Cluster Operator cloud-credential is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator kube-storage-version-migrator is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator cluster-autoscaler is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator machine-api is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator service-ca is stable"
time="2022-10-03T12:18:57Z" level=debug msg="Cluster Operator config-operator is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator machine-approver is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=152.270874612s"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator openshift-samples is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator marketplace is stable"
time="2022-10-03T12:18:58Z" level=debug msg="Cluster Operator console is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator image-registry is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator node-tuning is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator ingress is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator operator-lifecycle-manager-packageserver is stable"
time="2022-10-03T12:18:59Z" level=debug msg="Cluster Operator control-plane-machine-set is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator openshift-controller-manager is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator operator-lifecycle-manager is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator monitoring is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator machine-config is stable"
time="2022-10-03T12:19:00Z" level=debug msg="Cluster Operator operator-lifecycle-manager-catalog is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator dns is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator openshift-apiserver is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator csi-snapshot-controller is stable"
time="2022-10-03T12:19:01Z" level=debug msg="Cluster Operator network is stable"
time="2022-10-03T12:19:17Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=171.005530042s"
time="2022-10-03T12:19:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=184.629099327s"
time="2022-10-03T12:19:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=184.687712695s"
time="2022-10-03T12:20:05Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=219.355492607s"
time="2022-10-03T12:20:08Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=222.773698137s"
time="2022-10-03T12:20:15Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=229.642400053s"
time="2022-10-03T12:20:15Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=229.662644635s"
time="2022-10-03T12:20:39Z" level=debug msg="Cluster Operator kube-controller-manager is Progressing=False LastTransitionTime=2022-10-03 12:20:39 +0000 UTC DurationSinceTransition=0.656392816s"
time="2022-10-03T12:21:14Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=288.268508085s"
time="2022-10-03T12:21:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=304.623004056s"
time="2022-10-03T12:21:30Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=304.657572518s"
time="2022-10-03T12:22:03Z" level=debug msg="Cluster Operator etcd is Progressing=True LastTransitionTime=2022-10-03 12:16:26 +0000 UTC DurationSinceTransition=337.1953858s"
time="2022-10-03T12:22:13Z" level=debug msg="Cluster Operator etcd is Progressing=False LastTransitionTime=2022-10-03 12:22:13 +0000 UTC DurationSinceTransition=0.899676568s"
time="2022-10-03T12:22:15Z" level=debug msg="Cluster Operator etcd is Progressing=False LastTransitionTime=2022-10-03 12:22:13 +0000 UTC DurationSinceTransition=2.6364960809999998s"
time="2022-10-03T12:22:15Z" level=debug msg="Cluster Operator etcd is Progressing=False LastTransitionTime=2022-10-03 12:22:13 +0000 UTC DurationSinceTransition=2.667407134s"
time="2022-10-03T12:23:56Z" level=error msg="Cluster operator kube-apiserver does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 8\""
time="2022-10-03T12:23:56Z" level=error msg="Cluster operator kube-controller-manager does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 6\""
time="2022-10-03T12:23:56Z" level=error msg="Cluster operator etcd does not meet stability threshold of Progressing=false for greater than 30 seconds with Reason: \"AsExpected\" and Message: \"NodeInstallerProgressing: 3 nodes are at revision 8\\nEtcdMembersProgressing: No unstarted etcd members found\""

Ok: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6124/pull-ci-openshift-installer-master-e2e-aws-ovn/1576888802176143360 has a bit better debugging visibility.

It looks like etcd was correctly caught by the stability check (although I wonder if it just needed a little more time?) But I'm a little confused about whether a bug in my code is causing, in this case, the api server and kcm, to not be checked. I only see one check for those, while etcd has multiple frequent checks.

@deads2k
Copy link
Contributor

deads2k commented Oct 4, 2022

wrote patrickdillon#1 to demonstrate the replaying watch that this logic needs to stay clean and use a watch without writing an entire controller.

@patrickdillon
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2022
@patrickdillon
Copy link
Contributor Author

patrickdillon commented Oct 6, 2022

The most recent ovirt test shows that the most recent commit fixes the bug: kube-apiserver is not stable at the beginning of the check. At the end of the 5 minutes, it is stable:

time="2022-10-05T20:42:50Z" level=debug msg="Cluster is initialized"
time="2022-10-05T20:42:50Z" level=info msg="Waiting up to 5m0s (until 8:47PM) to ensure each cluster operator has finished progressing..."
time="2022-10-05T20:42:50Z" level=debug msg="Cluster Operator kube-scheduler is stable"
[...]
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator dns is stable"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2022-10-05 20:42:26 +0000 UTC DurationSinceTransition=28s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 7"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator marketplace is stable"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator console is stable"
time="2022-10-05T20:42:54Z" level=debug msg="Cluster Operator image-registry is stable"
time="2022-10-05T20:47:50Z" level=debug msg="Cluster operator kube-apiserver is now stable: Progressing=False LastTransitionTime=2022-10-05 20:42:26 +0000 UTC DurationSinceTransition=324s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 7"

So the most recent code is essentially going to impose a 5 minute wait if the operators are not stable during the initial check. I had to look through four or five tests to find this case, where the initial stability check failed. In all of those cases, the operators are stable and there is no 5 minute penalty.

In this case, kube api-server missed by two seconds. If we see this sort of case frequently but we are not catching errors, we could drop the 5 minute duration, so the penalty is not as severe.

@patrickdillon
Copy link
Contributor Author

/retest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube style is normally

if statusErr == nil{
return nil
}
if !errors.Is(waittimeout){
}
if meetsStabilityThreshold{
   return nil
}
return err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I have adopted this suggestion.

I'm not sure if I'm trying to be too clever, but the use of a closure here requires the code to be a little different. I need to always return err, which may be nil. I have added a comment to try to clarify this. The rationale for using this technique was that it would simplify the calling code (because it would no longer need to track whether an operator was unstable). But I almost forgot this was the point when refactoring coStabilityChecker, so I wonder if this is a good idea...

In any case, should be ok now!

@deads2k
Copy link
Contributor

deads2k commented Oct 6, 2022

I like this change. Minor style comments that don't match kube. We inherited the installer, so I'm not sure of its standards.

Adds a check to see whether each cluster operator has stopped
progressing for at least 30 seconds. There is a five minute period
where operators can meet this threshold.

This check prevents against a class of errors where operators report
themselves as Available=true but continue to progress and are not
fully functional.
@sdodson
Copy link
Member

sdodson commented Oct 7, 2022

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2022
@patrickdillon
Copy link
Contributor Author

/retest

@jhixson74
Copy link
Member

Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me.

@patrickdillon
Copy link
Contributor Author

/refresh

@patrickdillon
Copy link
Contributor Author

Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me.

@jhixson74 I talked to @sdodson & @deads2k and it looks like this is ready to go in, if you want to review or LGTM that would be great. Thank you!

@deads2k
Copy link
Contributor

deads2k commented Oct 17, 2022

Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me.

/lgtm

looks good to me too.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 045b952 and 2 for PR HEAD 088a068 in total

@jhixson74
Copy link
Member

/lgtm
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhixson74, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2022

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-vsphere 4d652d9d30289f14d0e9f64a428c71f56da01273 link false /test okd-scos-e2e-vsphere
ci/prow/okd-scos-e2e-aws-ovn 088a068 link false /test okd-scos-e2e-aws-ovn
ci/prow/okd-scos-e2e-aws-upgrade 088a068 link false /test okd-scos-e2e-aws-upgrade
ci/prow/e2e-ibmcloud-ovn 088a068 link false /test e2e-ibmcloud-ovn
ci/prow/e2e-libvirt 088a068 link false /test e2e-libvirt
ci/prow/okd-e2e-aws-upgrade 088a068 link false /test okd-e2e-aws-upgrade
ci/prow/okd-scos-e2e-gcp 4d652d9d30289f14d0e9f64a428c71f56da01273 link false /test okd-scos-e2e-gcp
ci/prow/okd-scos-e2e-gcp-ovn-upgrade 4d652d9d30289f14d0e9f64a428c71f56da01273 link false /test okd-scos-e2e-gcp-ovn-upgrade
ci/prow/okd-e2e-aws-ovn 088a068 link false /test okd-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sdodson
Copy link
Member

sdodson commented Oct 18, 2022

/retest-required
repo errors are generally transient in nature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants