-
Notifications
You must be signed in to change notification settings - Fork 292
prowgen: support overriding prowjob timeout #2294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prowgen: support overriding prowjob timeout #2294
Conversation
3e1ce7e to
849b869
Compare
|
Can you use |
In the end the timeout needs to be set, yes, but either:
Not sure which option is preferable, this PR implements option 1 |
|
I think this is fine if it has some sane enforced upper limit ('d be conservative and went for 8h). |
849b869 to
3bd7692
Compare
3bd7692 to
69ea12b
Compare
|
Limited job time to 8hrs |
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll also need a small doc followup for https://docs.ci.openshift.org/docs/architecture/timeouts/ (does not block merge)
| maxCustomDuration := time.Hour * 8 | ||
| if timeout != nil && timeout.Duration <= maxCustomDuration { | ||
| decorationConfig.Timeout = timeout | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should actively refuse this in ci-op config validation (https://github.com/openshift/ci-tools/blob/master/pkg/validation/test.go#L116) with some sensible error message instead of capping it silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, implemented
Add "Timeout" param to the test, which would reset `decoration_config.timeout` so that some tests could run longer than 4 hours
69ea12b to
e2369d4
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Catching up with [1]. [1]: openshift/ci-tools#2294
|
I've filed openshift/ci-docs#196 to doc this new setting. |
The 4.9 to 4.10 to 4.9 rollbacks keep hitting the 3h timeout: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | grep 'rollback.*failures match' periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact $ w3m -dump -cols 200 'https://search.ci.openshift.org/search?maxAge=96h&type=junit&context=0&name=4.10-upgrade-from-stable-4.9&search=Process+did+not+finish+before.*timeout' | jq -r 'to_entries[].value | to_entries[].value[] | .name + " " + .context[0]' | sed -n 's/\(.*rollback\) .*before \([^ ]*\) timeout.*/\1 \2/p' periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback 3h0m0s periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback 3h0m0s We've had the 3h timeout since 90fadfe (steps/openshift-e2e-test: e2e tests can take longer than 2h, 2021-01-06, openshift#14674). We still need some time for setup and teardown in the wrapping Prow job [1], so I'm also setting timeout on the jobs, via [2]. That leaves us exposed to situations where other jobs that use this same step hang up and spend so long in the step, that a wrapping Plank/Prow timeout leaves them with too little time to finish their teardown/gather. But if that happens, maybe the test-platform folks will give us either a way to override a single step's timeout for a job, or a blanket increase in the Plank/Prow cap. [1]: https://docs.ci.openshift.org/docs/architecture/timeouts/ [2]: openshift/ci-tools#2294
Add "Timeout" param to the test, which would reset
decoration_config.timeoutso that some tests could run longer than 4 hours.Example job: openshift/release#21355
TODO:
24hwould be sufficient for now?Ref: https://issues.redhat.com/browse/DPTP-889