Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Aug 25, 2021

Add "Timeout" param to the test, which would reset decoration_config.timeout so that some tests could run longer than 4 hours.

Example job: openshift/release#21355

TODO:

  • Add a limit to the timeouts - perhaps 24h would be sufficient for now?

Ref: https://issues.redhat.com/browse/DPTP-889

@wking
Copy link
Member

wking commented Aug 25, 2021

Can you use decoration_config on the jobs, like openshift/release#19638 ?

@vrutkovs
Copy link
Contributor Author

Can you use decoration_config on the jobs

In the end the timeout needs to be set, yes, but either:

  • prowgen should set it from step.timeout or
  • prowgen should ignore changes in decoration_config.timeout

Not sure which option is preferable, this PR implements option 1

@petr-muller
Copy link
Member

I think this is fine if it has some sane enforced upper limit ('d be conservative and went for 8h).

@openshift-ci openshift-ci bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 26, 2021
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 4, 2021
@vrutkovs vrutkovs force-pushed the override-job-timeout branch from 849b869 to 3bd7692 Compare September 6, 2021 09:51
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 6, 2021
@vrutkovs vrutkovs force-pushed the override-job-timeout branch from 3bd7692 to 69ea12b Compare September 6, 2021 11:44
@vrutkovs
Copy link
Contributor Author

vrutkovs commented Sep 6, 2021

Limited job time to 8hrs

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll also need a small doc followup for https://docs.ci.openshift.org/docs/architecture/timeouts/ (does not block merge)

maxCustomDuration := time.Hour * 8
if timeout != nil && timeout.Duration <= maxCustomDuration {
decorationConfig.Timeout = timeout
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actively refuse this in ci-op config validation (https://github.com/openshift/ci-tools/blob/master/pkg/validation/test.go#L116) with some sensible error message instead of capping it silently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, implemented

Add "Timeout" param to the test, which would reset `decoration_config.timeout` so that some tests could run longer than 4 hours
@vrutkovs vrutkovs force-pushed the override-job-timeout branch from 69ea12b to e2369d4 Compare September 8, 2021 10:46
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 9, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2021
@openshift-merge-robot openshift-merge-robot merged commit 78d1e5d into openshift:master Sep 9, 2021
wking added a commit to wking/ci-docs that referenced this pull request Nov 3, 2021
@wking
Copy link
Member

wking commented Nov 3, 2021

I've filed openshift/ci-docs#196 to doc this new setting.

wking added a commit to wking/openshift-release that referenced this pull request Nov 18, 2021
The 4.9 to 4.10 to 4.9 rollbacks keep hitting the 3h timeout:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | grep 'rollback.*failures match'
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
  $ w3m -dump -cols 200 'https://search.ci.openshift.org/search?maxAge=96h&type=junit&context=0&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | jq -r 'to_entries[].value | to_entries[].value[] | .name + " " + .context[0]' | sed -n 's/\(.*rollback\) .*before \([^ ]*\) timeout.*/\1 \2/p'
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s
  periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback 3h0m0s

We've had the 3h timeout since 90fadfe (steps/openshift-e2e-test:
e2e tests can take longer than 2h, 2021-01-06, openshift#14674).  We still need
some time for setup and teardown in the wrapping Prow job [1], so I'm
also setting timeout on the jobs, via [2].  That leaves us exposed to
situations where other jobs that use this same step hang up and spend
so long in the step, that a wrapping Plank/Prow timeout leaves them
with too little time to finish their teardown/gather.  But if that
happens, maybe the test-platform folks will give us either a way to
override a single step's timeout for a job, or a blanket increase in
the Plank/Prow cap.

[1]: https://docs.ci.openshift.org/docs/architecture/timeouts/
[2]: openshift/ci-tools#2294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants