create: add check for cluster operator stability #7289

r4f4 · 2023-06-29T21:43:22Z

Adds a check to see whether each cluster operator has stopped
progressing for at least 30 seconds. There is a five minute period
where operators can meet this threshold.

This check prevents against a class of errors where operators report
themselves as Available=true but continue to progress and are not
fully functional.

r4f4 · 2023-06-29T21:52:27Z

/hold
while we check how this fares with some CI runs

deads2k · 2023-07-05T13:24:57Z

Good catch.

/lgtm

r4f4 · 2023-07-05T13:29:57Z

Update: squashed commits, no code changes.
/hold cancel

deads2k · 2023-07-05T16:21:44Z

/lgtm

r4f4 · 2023-07-05T19:58:48Z

/payload 4.14 nightly blocking

openshift-ci · 2023-07-05T20:05:25Z

@r4f4: trigger 7 job(s) of type blocking for the nightly release of OCP 4.14

periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5afcbf10-1b6e-11ee-8392-52304d0d3916-0

deads2k · 2023-07-05T20:32:51Z

/payload 4.14 nightly informing

openshift-ci · 2023-07-05T20:32:58Z

@deads2k: trigger 52 job(s) of type informing for the nightly release of OCP 4.14

periodic-ci-openshift-release-master-nightly-4.14-console-aws
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-csi
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-fips
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-single-node
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-single-node-serial
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-cgroupsv2
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview-serial
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-upi
periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-csi
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn
periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-sdn
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview-serial
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-deploy-cnv
periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-upgrade-cnv
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-driver-toolkit
periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn
periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-ovn-csi
periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-ovn-rt
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn
periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial
periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview
periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview-serial
periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-sdn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-dualstack
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm-upgrade
periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-metal-ipi-sdn-bm-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-ipv4
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-virtualmedia-bond
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-serial-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-serial-ovn-dualstack
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-metal-ipi-upgrade-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ovn-assisted
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-proxy
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ovn-single-node-live-iso
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g
periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-csi
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-serial
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-techpreview
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-techpreview-serial
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-upi
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-upi-serial
periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-sdn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1d4145b0-1b73-11ee-99b0-81946b5f13c6-0

r4f4 · 2023-07-05T21:27:51Z

/test e2e-gcp-ovn e2e-azure-ovn e2e-vsphere-ovn-upi

sdodson · 2023-07-06T01:20:28Z

/payload 4.14 nightly blocking

openshift-ci · 2023-07-06T01:20:30Z

@sdodson: trigger 7 job(s) of type blocking for the nightly release of OCP 4.14

periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4acddbb0-1b9b-11ee-8e6e-f208b55cc3b4-0

sdodson · 2023-07-06T01:20:43Z

There were a number of odd test infra looking failures in the last run.

r4f4 · 2023-07-06T11:10:51Z

/test e2e-metal-ipi-ovn-ipv6 e2e-vsphere-ovn e2e-azurestack

r4f4 · 2023-07-06T14:36:54Z

/test e2e-vsphere-static-ovn
/test e2e-vsphere-upi-zones
/test e2e-vsphere-zones
/test e2e-vsphere-upi

cmd/openshift-install/create.go

deads2k · 2023-07-06T14:49:45Z

This run failed on the wait for apparently legitimate reasons: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7289/pull-ci-openshift-installer-master-e2e-vsphere-ovn/1676911592173735936

a node never came up. That failure is correct and isn't a problem with this PR per-se, but extending the timeout will make the "it never worked" more clear. Suggested 30 minutes above.

Adds a check to see whether each cluster operator has stopped progressing for at least 30 seconds. There is a five minute period where operators can meet this threshold. This check prevents against a class of errors where operators report themselves as Available=true but continue to progress and are not fully functional.

…ability has passed

sdodson · 2023-07-06T15:51:45Z

/test e2e-vsphere-ovn

patrickdillon · 2023-07-06T18:56:22Z

LGTM
/approve

openshift-ci · 2023-07-06T18:56:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [patrickdillon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cmd/openshift-install/create.go

wking · 2023-07-06T19:02:30Z

cmd/openshift-install/create.go

 }
+
+func getClusterOperatorNames(ctx context.Context, cc *configclient.Clientset) ([]string, error) {
+	listCtx, cancel := context.WithTimeout(ctx, 1*time.Minute)


I don't understand this 1m timeout. It seems unlikely that the list takes longer than that, but if it does, and the wrapping ctx has more time available, wouldn't we want to wait longer?

wking · 2023-07-06T19:12:50Z

cmd/openshift-install/create.go

+					logrus.Debugf("Cluster Operator %s is stable", name)
+					return true, nil
+				}
+				logrus.Debugf("Cluster Operator %s is Progressing=%s LastTransitionTime=%v DurationSinceTransition=%.fs Reason=%s Message=%s", name, progressing.Status, progressing.LastTransitionTime.Time, time.Since(progressing.LastTransitionTime.Time).Seconds(), progressing.Reason, progressing.Message)


Install logs from the vSphere OVN run:

time="2023-07-06T17:51:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=0s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6" time="2023-07-06T17:51:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=0s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6" time="2023-07-06T17:51:57Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=1s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6" time="2023-07-06T17:51:58Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=2s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6" time="2023-07-06T17:51:59Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=3s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"

Is this something we want to log each second? Especially when we hit the Progressing=False target and are just waiting out the clock on coStabilityThreshold? Maybe we could log once when it transitions to Progressing=False, and not until it changes after that? On the other hand, it's debug-level logs, so maybe not worth the effort to denoise?

cmd/openshift-install/create.go

deads2k · 2023-07-06T19:54:37Z

/lgtm

openshift-ci-robot · 2023-07-06T20:31:25Z

/retest-required

Remaining retests: 0 against base HEAD 30838d9 and 2 for PR HEAD 33c6ea3 in total

openshift-ci-robot · 2023-07-07T03:31:17Z

/retest-required

Remaining retests: 0 against base HEAD a58142c and 1 for PR HEAD 33c6ea3 in total

patrickdillon · 2023-07-07T13:21:50Z

/hold

I'm suspicious of the results in the vsphere test

The round number gives me pause:

time="2023-07-07T05:50:15Z" level=debug msg=" Cluster Operators Stable: 30m0s"

For some reason, in this run, it looks like the authentication operator is stable, but we are waiting quite a while (until the deadline?) to register that stability

time="2023-07-07T05:20:15Z" level=debug msg="Cluster Operator authentication is Progressing=False LastTransitionTime=2023-07-07 05:19:51 +0000 UTC DurationSinceTransition=25s Reason=AsExpected Message=AuthenticatorCertKeyProgressing: All is well"
[...]
time="2023-07-07T05:21:04Z" level=debug msg="Cluster Operator kube-apiserver is stable"
time="2023-07-07T05:50:15Z" level=debug msg="Cluster operator authentication is now stable: Progressing=False LastTransitionTime=2023-07-07 05:19:51 +0000 UTC DurationSinceTransition=1825s Reason=AsExpected Message=AuthenticatorCertKeyProgressing: All is well"
time="2023-07-07T05:50:15Z" level=info msg="All cluster operators have completed progressing"

edit: I haven't had time to dive into the code, but looking at the logs, it seems the apiserver, which is Progressing=True on the initial check is treated differently than the authentication operator which is Progressing=False (but less than the 30s threshold)

`logrus.Exit` calls `os.Exit` so let's just return the error during the operators check and exit at the end of the wait for all operators so we can get information about all of them instead of exitting at the first error.

openshift-ci · 2023-07-07T13:54:34Z

New changes are detected. LGTM label has been removed.

r4f4 · 2023-07-07T15:54:03Z

/test e2e-vsphere-ovn

openshift-ci · 2023-07-07T20:27:44Z

@r4f4: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`fb52034`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/okd-e2e-aws-ovn	`fb52034`	link	false	`/test okd-e2e-aws-ovn`
ci/prow/okd-e2e-aws-ovn-upgrade	`fb52034`	link	false	`/test okd-e2e-aws-ovn-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

r4f4 · 2023-07-07T21:02:42Z

Should be fixed now.
/hold cancel

deads2k · 2023-07-10T14:25:22Z

/lgtm

r4f4 · 2023-07-10T15:09:44Z

/skip

This reverts commit 3bc9b24, reversing changes made to a4604b0. This change is sometimes failing with ``` E0712 09:14:15.239148 4409 runtime.go:79] Observed a panic: "send on closed channel" (send on closed channel) ``` Let's revert while we investigate.

…settle-ops"" This reverts commit 017b4f0.

Revert "Merge pull request #7289 from r4f4/padillon-settle-ops"

This reverts commit 017b4f0.

openshift-ci bot requested review from rna-afk and sadasu June 29, 2023 21:44

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 29, 2023

openshift-ci bot assigned deads2k Jul 5, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

r4f4 force-pushed the padillon-settle-ops branch from 70c62bd to da4a6c7 Compare July 5, 2023 13:29

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 5, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

deads2k reviewed Jul 6, 2023

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

patrickdillon and others added 2 commits July 6, 2023 17:04

add replaying watch to quickly detect when the 30s wait period for st…

33c6ea3

…ability has passed

r4f4 force-pushed the padillon-settle-ops branch from da4a6c7 to 33c6ea3 Compare July 6, 2023 15:04

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2023

wking reviewed Jul 6, 2023

View reviewed changes

cmd/openshift-install/create.go Show resolved Hide resolved

wking reviewed Jul 6, 2023

View reviewed changes

cmd/openshift-install/create.go Show resolved Hide resolved

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2023

create: only exit at the end of operator stability watcher

02b0828

`logrus.Exit` calls `os.Exit` so let's just return the error during the operators check and exit at the end of the wait for all operators so we can get information about all of them instead of exitting at the first error.

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2023

create: use the initial list to create the initial LastEvent

fb52034

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2023

deads2k added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2023

openshift-merge-robot merged commit 3bc9b24 into openshift:master Jul 11, 2023

deads2k added a commit to deads2k/installer that referenced this pull request Jul 12, 2023

Revert "Revert "Merge pull request openshift#7289 from r4f4/padillon-…

b3db20a

…settle-ops"" This reverts commit 017b4f0.

patrickdillon added a commit that referenced this pull request Jul 12, 2023

Merge pull request #7318 from r4f4/revert-cluster-op-stability

89f7ec2

Revert "Merge pull request #7289 from r4f4/padillon-settle-ops"

deads2k added a commit to deads2k/installer that referenced this pull request Sep 22, 2023

Restore operator stability check from openshift#7289

b809977

This reverts commit 017b4f0.

deads2k mentioned this pull request Sep 22, 2023

Wait for all operators #7525

Closed

create: add check for cluster operator stability #7289

create: add check for cluster operator stability #7289

Uh oh!

Conversation

r4f4 commented Jun 29, 2023

Uh oh!

r4f4 commented Jun 29, 2023

Uh oh!

deads2k commented Jul 5, 2023

Uh oh!

r4f4 commented Jul 5, 2023

Uh oh!

deads2k commented Jul 5, 2023

Uh oh!

r4f4 commented Jul 5, 2023

Uh oh!

openshift-ci bot commented Jul 5, 2023

Uh oh!

deads2k commented Jul 5, 2023

Uh oh!

openshift-ci bot commented Jul 5, 2023

Uh oh!

r4f4 commented Jul 5, 2023

Uh oh!

sdodson commented Jul 6, 2023

Uh oh!

openshift-ci bot commented Jul 6, 2023

Uh oh!

sdodson commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r4f4 commented Jul 6, 2023

Uh oh!

r4f4 commented Jul 6, 2023

Uh oh!

Uh oh!

deads2k commented Jul 6, 2023

Uh oh!

sdodson commented Jul 6, 2023

Uh oh!

patrickdillon commented Jul 6, 2023

Uh oh!

openshift-ci bot commented Jul 6, 2023

Uh oh!

Uh oh!

wking Jul 6, 2023

Choose a reason for hiding this comment

Uh oh!

wking Jul 6, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deads2k commented Jul 6, 2023

Uh oh!

openshift-ci-robot commented Jul 6, 2023

Uh oh!

openshift-ci-robot commented Jul 7, 2023

Uh oh!

patrickdillon commented Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jul 7, 2023

Uh oh!

r4f4 commented Jul 7, 2023

Uh oh!

openshift-ci bot commented Jul 7, 2023

Uh oh!

r4f4 commented Jul 7, 2023

Uh oh!

deads2k commented Jul 10, 2023

Uh oh!

r4f4 commented Jul 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

sdodson commented Jul 6, 2023 •

edited

Loading

patrickdillon commented Jul 7, 2023 •

edited

Loading