Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 28, 2021

The 4.9 to 4.10 to 4.9 rollbacks keep hitting the 3h timeout:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | grep 'rollback.*failures match'
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=junit&context=0&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | jq -r 'to_entries[].value | to_entries[].value[] | .name + " " + .context[0]' | sed -n 's/\(.*rollback\) .*before \([^ ]*\) timeout.*/\1 \2/p'
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback 3h0m0s

We've had the 3h timeout since 90fadfe (#14674). We still need some time for setup and teardown in the wrapping Prow job, so I'm also setting decoration_config to raise that limit for these jobs. That leaves us exposed to situations where other jobs that use this same step hang up and spend so long in the step, that a wrapping Plank/Prow timeout leaves them with too little time to finish their teardown/gather. But if that happens, maybe the test-platform folks will give us either a way to override a single step's timeout for a job, or a blanket increase in the Plank/Prow cap.

@wking wking force-pushed the bump-4.9-to-4.10-rollback-timeout branch 4 times, most recently from a12356c to 2717192 Compare November 3, 2021 04:40
Copy link
Contributor

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2021

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift/vmware-vsphere-csi-driver/release-4.9/e2e-vsphere 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/vmware-vsphere-csi-driver/release-4.9/e2e-vsphere-csi 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/release-4.9/e2e-azure-ovn-dualstack 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.5/e2e-aws-csi 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-image-ecosystem 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-builds 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-gcp-image-ecosystem 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.2/e2e-cmd 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/release-4.9/e2e-gcp-single-node 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/kubernetes/release-4.9/e2e-openstack-csi-manila 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.6/e2e-agnostic-cmd 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-gcp-disruptive 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-openstack-ccm 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-azure-resourcegroup 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-gcp-builds 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-kube-controller-manager-operator/release-4.10/e2e-aws-ccm 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/release-4.9/e2e-ovn-ipsec-step-registry 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws-disruptive 83beaa060cfd3eeecfb749a455c6148f5c08c1c7 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-ovirt-upgrade 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-cluster-api-provider-kubevirt-release-4.9-sanity-ovn 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/ovirt-csi-driver/release-4.7/e2e-ovirt 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/router/release-4.9/e2e-agnostic 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/redhat-developer/jenkins-operator/main/e2e 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-techpreview 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-ovn 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/release-4.9/e2e-aws-manual-oidc 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/openstack-cinder-csi-driver-operator/release-4.9/e2e-openstack-csi 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-calico 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/aws-efs-csi-driver-operator/release-4.9/operator-e2e 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-techpreview-serial 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview-serial 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-upgrade 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-azure-ccm-install 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.9/e2e-gcp-disruptive-ovn 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-azure-ccm 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback 271719242bf1ca07e1752cef086c686299f72358 link unknown /test pj-rehearse
ci/prow/pj-rehearse 271719242bf1ca07e1752cef086c686299f72358 link false /test pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vrutkovs
Copy link
Contributor

/lgtm cancel
Needs make jobs

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2021
The 4.9 to 4.10 to 4.9 rollbacks keep hitting the 3h timeout:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | grep 'rollback.*failures match'
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
  $ w3m -dump -cols 200 'https://search.ci.openshift.org/search?maxAge=96h&type=junit&context=0&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | jq -r 'to_entries[].value | to_entries[].value[] | .name + " " + .context[0]' | sed -n 's/\(.*rollback\) .*before \([^ ]*\) timeout.*/\1 \2/p'
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback 3h0m0s

We've had the 3h timeout since 90fadfe (steps/openshift-e2e-test:
e2e tests can take longer than 2h, 2021-01-06, openshift#14674).  We still need
some time for setup and teardown in the wrapping Prow job [1], so I'm
also setting timeout on the jobs, via [2].  That leaves us exposed to
situations where other jobs that use this same step hang up and spend
so long in the step, that a wrapping Plank/Prow timeout leaves them
with too little time to finish their teardown/gather.  But if that
happens, maybe the test-platform folks will give us either a way to
override a single step's timeout for a job, or a blanket increase in
the Plank/Prow cap.

[1]: https://docs.ci.openshift.org/docs/architecture/timeouts/
[2]: openshift/ci-tools#2294
@wking wking force-pushed the bump-4.9-to-4.10-rollback-timeout branch from 2717192 to c1c1989 Compare November 18, 2021 17:19
Copy link
Contributor

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2021
@openshift-merge-robot openshift-merge-robot merged commit 49025ee into openshift:master Nov 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2021

@wking: Updated the following 3 configmaps:

  • ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master__ci-4.10-upgrade-from-stable-4.9.yaml using file ci-operator/config/openshift/release/openshift-release-master__ci-4.10-upgrade-from-stable-4.9.yaml
  • job-config-master-periodics configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-e2e-test-ref.yaml using file ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
Details

In response to this:

The 4.9 to 4.10 to 4.9 rollbacks keep hitting the 3h timeout:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | grep 'rollback.*failures match'
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=junit&context=0&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | jq -r 'to_entries[].value | to_entries[].value[] | .name + " " + .context[0]' | sed -n 's/\(.*rollback\) .*before \([^ ]*\) timeout.*/\1 \2/p'
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback 3h0m0s

We've had the 3h timeout since 90fadfe (#14674). We still need some time for setup and teardown in the wrapping Prow job, so I'm also setting decoration_config to raise that limit for these jobs. That leaves us exposed to situations where other jobs that use this same step hang up and spend so long in the step, that a wrapping Plank/Prow timeout leaves them with too little time to finish their teardown/gather. But if that happens, maybe the test-platform folks will give us either a way to override a single step's timeout for a job, or a blanket increase in the Plank/Prow cap.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the bump-4.9-to-4.10-rollback-timeout branch November 25, 2021 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants