Skip to content

Conversation

@r4f4
Copy link
Contributor

@r4f4 r4f4 commented Jun 29, 2023

Adds a check to see whether each cluster operator has stopped
progressing for at least 30 seconds. There is a five minute period
where operators can meet this threshold.

This check prevents against a class of errors where operators report
themselves as Available=true but continue to progress and are not
fully functional.

@openshift-ci openshift-ci bot requested review from rna-afk and sadasu June 29, 2023 21:44
@r4f4
Copy link
Contributor Author

r4f4 commented Jun 29, 2023

/hold
while we check how this fares with some CI runs

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 29, 2023
@deads2k
Copy link
Contributor

deads2k commented Jul 5, 2023

Good catch.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023
@r4f4 r4f4 force-pushed the padillon-settle-ops branch from 70c62bd to da4a6c7 Compare July 5, 2023 13:29
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023
@r4f4
Copy link
Contributor Author

r4f4 commented Jul 5, 2023

Update: squashed commits, no code changes.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 5, 2023
@deads2k
Copy link
Contributor

deads2k commented Jul 5, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023
@r4f4
Copy link
Contributor Author

r4f4 commented Jul 5, 2023

/payload 4.14 nightly blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 5, 2023

@r4f4: trigger 7 job(s) of type blocking for the nightly release of OCP 4.14

  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5afcbf10-1b6e-11ee-8392-52304d0d3916-0

@deads2k
Copy link
Contributor

deads2k commented Jul 5, 2023

/payload 4.14 nightly informing

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 5, 2023

@deads2k: trigger 52 job(s) of type informing for the nightly release of OCP 4.14

  • periodic-ci-openshift-release-master-nightly-4.14-console-aws
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-csi
  • periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-fips
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-single-node-serial
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-cgroupsv2
  • periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview
  • periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-upi
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-csi
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-sdn
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-deploy-cnv
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-azure-upgrade-cnv
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-driver-toolkit
  • periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-ovn-rt
  • periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial
  • periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview
  • periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-sdn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-metal-ipi-sdn-bm-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-ipv4
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-serial-virtualmedia-bond
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-serial-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-serial-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ovn-assisted
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-proxy
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ovn-single-node-live-iso
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g
  • periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-serial
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-upi
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-ovn-upi-serial
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-vsphere-sdn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1d4145b0-1b73-11ee-99b0-81946b5f13c6-0

@r4f4
Copy link
Contributor Author

r4f4 commented Jul 5, 2023

/test e2e-gcp-ovn e2e-azure-ovn e2e-vsphere-ovn-upi

@sdodson
Copy link
Member

sdodson commented Jul 6, 2023

/payload 4.14 nightly blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 6, 2023

@sdodson: trigger 7 job(s) of type blocking for the nightly release of OCP 4.14

  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4acddbb0-1b9b-11ee-8e6e-f208b55cc3b4-0

@sdodson
Copy link
Member

sdodson commented Jul 6, 2023

There were a number of odd test infra looking failures in the last run.

@r4f4
Copy link
Contributor Author

r4f4 commented Jul 6, 2023

/test e2e-metal-ipi-ovn-ipv6 e2e-vsphere-ovn e2e-azurestack

@r4f4
Copy link
Contributor Author

r4f4 commented Jul 6, 2023

/test e2e-vsphere-static-ovn
/test e2e-vsphere-upi-zones
/test e2e-vsphere-zones
/test e2e-vsphere-upi

@deads2k
Copy link
Contributor

deads2k commented Jul 6, 2023

This run failed on the wait for apparently legitimate reasons: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7289/pull-ci-openshift-installer-master-e2e-vsphere-ovn/1676911592173735936

a node never came up. That failure is correct and isn't a problem with this PR per-se, but extending the timeout will make the "it never worked" more clear. Suggested 30 minutes above.

patrickdillon and others added 2 commits July 6, 2023 17:04
Adds a check to see whether each cluster operator has stopped
progressing for at least 30 seconds. There is a five minute period
where operators can meet this threshold.

This check prevents against a class of errors where operators report
themselves as Available=true but continue to progress and are not
fully functional.
@r4f4 r4f4 force-pushed the padillon-settle-ops branch from da4a6c7 to 33c6ea3 Compare July 6, 2023 15:04
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2023
@sdodson
Copy link
Member

sdodson commented Jul 6, 2023

/test e2e-vsphere-ovn

@patrickdillon
Copy link
Contributor

LGTM
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2023
}

func getClusterOperatorNames(ctx context.Context, cc *configclient.Clientset) ([]string, error) {
listCtx, cancel := context.WithTimeout(ctx, 1*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this 1m timeout. It seems unlikely that the list takes longer than that, but if it does, and the wrapping ctx has more time available, wouldn't we want to wait longer?

logrus.Debugf("Cluster Operator %s is stable", name)
return true, nil
}
logrus.Debugf("Cluster Operator %s is Progressing=%s LastTransitionTime=%v DurationSinceTransition=%.fs Reason=%s Message=%s", name, progressing.Status, progressing.LastTransitionTime.Time, time.Since(progressing.LastTransitionTime.Time).Seconds(), progressing.Reason, progressing.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Install logs from the vSphere OVN run:

time="2023-07-06T17:51:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=0s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:56Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=0s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:57Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=1s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:58Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=2s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"
time="2023-07-06T17:51:59Z" level=debug msg="Cluster Operator kube-apiserver is Progressing=False LastTransitionTime=2023-07-06 17:51:56 +0000 UTC DurationSinceTransition=3s Reason=AsExpected Message=NodeInstallerProgressing: 3 nodes are at revision 6"

Is this something we want to log each second? Especially when we hit the Progressing=False target and are just waiting out the clock on coStabilityThreshold? Maybe we could log once when it transitions to Progressing=False, and not until it changes after that? On the other hand, it's debug-level logs, so maybe not worth the effort to denoise?

@deads2k
Copy link
Contributor

deads2k commented Jul 6, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 30838d9 and 2 for PR HEAD 33c6ea3 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a58142c and 1 for PR HEAD 33c6ea3 in total

@patrickdillon
Copy link
Contributor

patrickdillon commented Jul 7, 2023

/hold

I'm suspicious of the results in the vsphere test

The round number gives me pause:

time="2023-07-07T05:50:15Z" level=debug msg=" Cluster Operators Stable: 30m0s"

For some reason, in this run, it looks like the authentication operator is stable, but we are waiting quite a while (until the deadline?) to register that stability

time="2023-07-07T05:20:15Z" level=debug msg="Cluster Operator authentication is Progressing=False LastTransitionTime=2023-07-07 05:19:51 +0000 UTC DurationSinceTransition=25s Reason=AsExpected Message=AuthenticatorCertKeyProgressing: All is well"
[...]
time="2023-07-07T05:21:04Z" level=debug msg="Cluster Operator kube-apiserver is stable"
time="2023-07-07T05:50:15Z" level=debug msg="Cluster operator authentication is now stable: Progressing=False LastTransitionTime=2023-07-07 05:19:51 +0000 UTC DurationSinceTransition=1825s Reason=AsExpected Message=AuthenticatorCertKeyProgressing: All is well"
time="2023-07-07T05:50:15Z" level=info msg="All cluster operators have completed progressing"

edit: I haven't had time to dive into the code, but looking at the logs, it seems the apiserver, which is Progressing=True on the initial check is treated differently than the authentication operator which is Progressing=False (but less than the 30s threshold)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2023
`logrus.Exit` calls `os.Exit` so let's just return the error during the
operators check and exit at the end of the wait for all operators so we
can get information about all of them instead of exitting at the first
error.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 7, 2023

New changes are detected. LGTM label has been removed.

@r4f4
Copy link
Contributor Author

r4f4 commented Jul 7, 2023

/test e2e-vsphere-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 7, 2023

@r4f4: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn fb52034 link false /test okd-scos-e2e-aws-ovn
ci/prow/okd-e2e-aws-ovn fb52034 link false /test okd-e2e-aws-ovn
ci/prow/okd-e2e-aws-ovn-upgrade fb52034 link false /test okd-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@r4f4
Copy link
Contributor Author

r4f4 commented Jul 7, 2023

Should be fixed now.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2023
@deads2k
Copy link
Contributor

deads2k commented Jul 10, 2023

/lgtm

@r4f4
Copy link
Contributor Author

r4f4 commented Jul 10, 2023

/skip

@deads2k deads2k added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2023
@openshift-merge-robot openshift-merge-robot merged commit 3bc9b24 into openshift:master Jul 11, 2023
r4f4 added a commit to r4f4/installer that referenced this pull request Jul 12, 2023
This reverts commit 3bc9b24, reversing
changes made to a4604b0.

This change is sometimes failing with
```
 E0712 09:14:15.239148    4409 runtime.go:79] Observed a panic: "send on closed channel" (send on closed channel)
```

Let's revert while we investigate.
deads2k added a commit to deads2k/installer that referenced this pull request Jul 12, 2023
patrickdillon added a commit that referenced this pull request Jul 12, 2023
Revert "Merge pull request #7289 from r4f4/padillon-settle-ops"
deads2k added a commit to deads2k/installer that referenced this pull request Sep 22, 2023
@deads2k deads2k mentioned this pull request Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants