-
Notifications
You must be signed in to change notification settings - Fork 1.5k
create: add check for cluster operator stability #6124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create: add check for cluster operator stability #6124
Conversation
|
/cc @deads2k |
50021d7 to
b40bd1b
Compare
|
/hold |
b40bd1b to
a187bae
Compare
a187bae to
d064e4c
Compare
|
/retest |
|
Openstack e2e is failing stability check |
|
Added a debug statement (without testing 👼) in eadfe4e |
It seems like this operator was only checked once... will add some more debug statements to check this. |
eadfe4e to
98f2a32
Compare
|
/test all |
Ok: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6124/pull-ci-openshift-installer-master-e2e-aws-ovn/1576888802176143360 has a bit better debugging visibility. It looks like etcd was correctly caught by the stability check (although I wonder if it just needed a little more time?) But I'm a little confused about whether a bug in my code is causing, in this case, the api server and kcm, to not be checked. I only see one check for those, while etcd has multiple frequent checks. |
98f2a32 to
75e409b
Compare
|
wrote patrickdillon#1 to demonstrate the replaying watch that this logic needs to stay clean and use a watch without writing an entire controller. |
3b09356 to
69d24e7
Compare
|
/hold cancel |
|
The most recent ovirt test shows that the most recent commit fixes the bug: kube-apiserver is not stable at the beginning of the check. At the end of the 5 minutes, it is stable: So the most recent code is essentially going to impose a 5 minute wait if the operators are not stable during the initial check. I had to look through four or five tests to find this case, where the initial stability check failed. In all of those cases, the operators are stable and there is no 5 minute penalty. In this case, kube api-server missed by two seconds. If we see this sort of case frequently but we are not catching errors, we could drop the 5 minute duration, so the penalty is not as severe. |
69d24e7 to
4d652d9
Compare
|
/retest |
cmd/openshift-install/create.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube style is normally
if statusErr == nil{
return nil
}
if !errors.Is(waittimeout){
}
if meetsStabilityThreshold{
return nil
}
return errThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I have adopted this suggestion.
I'm not sure if I'm trying to be too clever, but the use of a closure here requires the code to be a little different. I need to always return err, which may be nil. I have added a comment to try to clarify this. The rationale for using this technique was that it would simplify the calling code (because it would no longer need to track whether an operator was unstable). But I almost forgot this was the point when refactoring coStabilityChecker, so I wonder if this is a good idea...
In any case, should be ok now!
|
I like this change. Minor style comments that don't match kube. We inherited the installer, so I'm not sure of its standards. |
Adds a check to see whether each cluster operator has stopped progressing for at least 30 seconds. There is a five minute period where operators can meet this threshold. This check prevents against a class of errors where operators report themselves as Available=true but continue to progress and are not fully functional.
4d652d9 to
088a068
Compare
|
/approve |
|
/retest |
|
Are we just waiting for tests to pass? Or is more discussion needed? This LGTM to me. |
|
/refresh |
@jhixson74 I talked to @sdodson & @deads2k and it looks like this is ready to go in, if you want to review or LGTM that would be great. Thank you! |
/lgtm looks good to me too. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jhixson74, sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@patrickdillon: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest-required |
Following up from Cluster Lifecycle call in late February, this is a PoC implementation of having the installer check cluster operators for stability. There is openshift/enhancements#1189 for discussion.
Output on a successful run looks like this.
Note, I would like to get ride of the client-side throttling output, but not sure how to do that.
Input is extremely welcome as this is not my normal coding domain.