-
Notifications
You must be signed in to change notification settings - Fork 1.5k
create: add check for cluster operator stability #7289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create: add check for cluster operator stability #7289
Conversation
|
/hold |
|
Good catch. /lgtm |
|
Update: squashed commits, no code changes. |
|
/lgtm |
|
/payload 4.14 nightly blocking |
|
@r4f4: trigger 7 job(s) of type blocking for the nightly release of OCP 4.14
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5afcbf10-1b6e-11ee-8392-52304d0d3916-0 |
|
/payload 4.14 nightly informing |
|
@deads2k: trigger 52 job(s) of type informing for the nightly release of OCP 4.14
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1d4145b0-1b73-11ee-99b0-81946b5f13c6-0 |
|
/test e2e-gcp-ovn e2e-azure-ovn e2e-vsphere-ovn-upi |
|
/payload 4.14 nightly blocking |
|
@sdodson: trigger 7 job(s) of type blocking for the nightly release of OCP 4.14
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4acddbb0-1b9b-11ee-8e6e-f208b55cc3b4-0 |
|
There were a number of odd test infra looking failures in the last run. |
|
/test e2e-metal-ipi-ovn-ipv6 e2e-vsphere-ovn e2e-azurestack |
|
/test e2e-vsphere-static-ovn |
|
This run failed on the wait for apparently legitimate reasons: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7289/pull-ci-openshift-installer-master-e2e-vsphere-ovn/1676911592173735936 a node never came up. That failure is correct and isn't a problem with this PR per-se, but extending the timeout will make the "it never worked" more clear. Suggested 30 minutes above. |
Adds a check to see whether each cluster operator has stopped progressing for at least 30 seconds. There is a five minute period where operators can meet this threshold. This check prevents against a class of errors where operators report themselves as Available=true but continue to progress and are not fully functional.
…ability has passed
|
/test e2e-vsphere-ovn |
|
LGTM |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| } | ||
|
|
||
| func getClusterOperatorNames(ctx context.Context, cc *configclient.Clientset) ([]string, error) { | ||
| listCtx, cancel := context.WithTimeout(ctx, 1*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this 1m timeout. It seems unlikely that the list takes longer than that, but if it does, and the wrapping ctx has more time available, wouldn't we want to wait longer?
| logrus.Debugf("Cluster Operator %s is stable", name) | ||
| return true, nil | ||
| } | ||
| logrus.Debugf("Cluster Operator %s is Progressing=%s LastTransitionTime=%v DurationSinceTransition=%.fs Reason=%s Message=%s", name, progressing.Status, progressing.LastTransitionTime.Time, time.Since(progressing.LastTransitionTime.Time).Seconds(), progressing.Reason, progressing.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Install logs from the vSphere OVN run:
time="2023-07-06T17:51:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=0s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=0s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:57Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=1s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:58Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=2s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:59Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=3s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
Is this something we want to log each second? Especially when we hit the Progressing=False target and are just waiting out the clock on coStabilityThreshold? Maybe we could log once when it transitions to Progressing=False, and not until it changes after that? On the other hand, it's debug-level logs, so maybe not worth the effort to denoise?
|
/lgtm |
|
/hold I'm suspicious of the results in the vsphere test The round number gives me pause:
For some reason, in this run, it looks like the authentication operator is stable, but we are waiting quite a while (until the deadline?) to register that stability edit: I haven't had time to dive into the code, but looking at the logs, it seems the apiserver, which is |
`logrus.Exit` calls `os.Exit` so let's just return the error during the operators check and exit at the end of the wait for all operators so we can get information about all of them instead of exitting at the first error.
|
New changes are detected. LGTM label has been removed. |
|
/test e2e-vsphere-ovn |
|
@r4f4: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Should be fixed now. |
|
/lgtm |
|
/skip |
…settle-ops"" This reverts commit 017b4f0.
Revert "Merge pull request #7289 from r4f4/padillon-settle-ops"
This reverts commit 017b4f0.
Adds a check to see whether each cluster operator has stopped
progressing for at least 30 seconds. There is a five minute period
where operators can meet this threshold.
This check prevents against a class of errors where operators report
themselves as Available=true but continue to progress and are not
fully functional.