-
Notifications
You must be signed in to change notification settings - Fork 2k
ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638
Conversation
Bumping the timeout will reduce the frequency of a few problem
classes:
* Jobs which take a while acquire a Boskos lease, and then aren't left
with enough ProwJob time remaining to complete their test. For
example, [1]:
INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
...
INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
...
INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}
[1] is an openshift/origin job, and I'm only bumping the timeout for
the chained-update periodics, but those periodics are exposed to
this same problem class.
* Jobs where the test is slow enough that it does not fit into the
ProwJob time, regardless of how quickly they acquire a ProwJob
lease. For example, [2]:
INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
...
INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
...
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}
Downsides to raising the timeout are mostly increases to CI cost and
reduced CI capacity due to resource consumption. Possible mechanisms
include:
* Jobs that actually hang might consume resources for longer. These
can be mitigated with lower-level timeouts, e.g. on the steps [3].
* Jobs that were aborted without completing teardown. These can be
mitigated by improving workflows to make it more likely that the
teardown step gets completed when an earlier step times out [4].
These can also be mitigated by per-job knobs on leaked resource
reaping, like AWS's expirationDate, which we've set since
f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags,
2018-07-26, openshift#1103), and which is used by
core-services/ipi-deprovision/aws.sh. In this commit, I'm raising
the expirationDate offset to 8h for step workflows, because those
have access to the step timeouts [3]. But I'm leaving it at 4h for
the templates, because those do not. Slow AWS jobs which continue
to use templates should migrate to steps.
The GCP installer currently lacks the ability to create that sort of
resource tag, so I'm adjusting some hard-coded defaults in that file
to keep up with this commit.
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26235/pull-ci-openshift-origin-master-e2e-aws-fips/1405162123687890944
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1407709521396109312
[3]: https://docs.ci.openshift.org/docs/architecture/timeouts/#step-registry-test-process-timeouts
[4]: https://docs.ci.openshift.org/docs/architecture/timeouts/#how-interruptions-may-be-handled
a70151f to
64d9389
Compare
|
@petr-muller pointed out the per-job syntax. I've opened openshift/ci-docs#162 to make that more discoverable, and pivoted this PR to just bump the chained-update jobs with a70151f7f8 -> 64d9389. |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Updated the following 5 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
From the test-infra docs,
extra_refsis used for matching periodics, and:seems to be on these periodics. Bumping the timeout will reduce the frequency of a few problem classes:
Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, this job:
that job is an openshift/origin job, and I'm only bumping the timeout for openshift/release, but openshift/release is exposed to this same problem class.
Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, this job:
Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include:
Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps.
Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's
expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags #1103), and which is used bycore-services/ipi-deprovision/aws.sh. In this commit, I'm raising theexpirationDateoffset to 8h for step workflows, because those have access to the step timeouts. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps.The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit.