ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638

wking · 2021-06-23T22:43:13Z

From the test-infra docs, extra_refs is used for matching periodics, and:

$ git grep -A4 extra_refs ci-operator/jobs/openshift/release | head -n5
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml:  extra_refs:
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  - base_ref: master
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    org: openshift
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    repo: release
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  interval: 24h

seems to be on these periodics. Bumping the timeout will reduce the frequency of a few problem classes:

Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, this job:

  INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
  ...
  INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
  ...
  INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}

that job is an openshift/origin job, and I'm only bumping the timeout for openshift/release, but openshift/release is exposed to this same problem class.

Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, this job:

  INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
  ...
  INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
  ...
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}

Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include:

Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps.
Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags #1103), and which is used by core-services/ipi-deprovision/aws.sh. In this commit, I'm raising the expirationDate offset to 8h for step workflows, because those have access to the step timeouts. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps.

The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit.

Bumping the timeout will reduce the frequency of a few problem classes: * Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, [1]: INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5 ... INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05] ... INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra. {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"} [1] is an openshift/origin job, and I'm only bumping the timeout for the chained-update periodics, but those periodics are exposed to this same problem class. * Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, [2]: INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289 ... INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03] ... {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"} Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include: * Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps [3]. * Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out [4]. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift#1103), and which is used by core-services/ipi-deprovision/aws.sh. In this commit, I'm raising the expirationDate offset to 8h for step workflows, because those have access to the step timeouts [3]. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps. The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26235/pull-ci-openshift-origin-master-e2e-aws-fips/1405162123687890944 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1407709521396109312 [3]: https://docs.ci.openshift.org/docs/architecture/timeouts/#step-registry-test-process-timeouts [4]: https://docs.ci.openshift.org/docs/architecture/timeouts/#how-interruptions-may-be-handled

wking · 2021-06-24T00:07:06Z

@petr-muller pointed out the per-job syntax. I've opened openshift/ci-docs#162 to make that more discoverable, and pivoted this PR to just bump the chained-update jobs with a70151f7f8 -> 64d9389.

openshift-ci · 2021-06-24T04:18:56Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.1/e2e-aws-image-ecosystem	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.1/e2e-aws-builds	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.9/e2e-aws	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/redhat-developer/jenkins-operator/main/e2e	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.2/e2e-cmd	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/cloud-credential-operator/release-4.9/e2e-aws-manual-oidc	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-calico	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-ovn-hybrid-step-registry	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/release-4.9/e2e-aws-upgrade	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-aws-ccm	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.9/e2e-aws-disruptive	`64d9389`	link	`/test pj-rehearse`
ci/prow/pj-rehearse	`64d9389`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node	`64d9389`	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2021-06-24T08:55:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/jobs/openshift/release/OWNERS~~ [petr-muller]
~~ci-operator/step-registry/ipi/OWNERS~~ [petr-muller,wking]
~~core-services/ipi-deprovision/OWNERS~~ [petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2021-06-24T09:02:12Z

@wking: Updated the following 5 configmaps:

job-config-4.6 configmap in namespace ci at cluster app.ci using the following files:
- key openshift-release-release-4.6-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.6-periodics.yaml
job-config-4.7 configmap in namespace ci at cluster app.ci using the following files:
- key openshift-release-release-4.7-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.7-periodics.yaml
job-config-4.8 configmap in namespace ci at cluster app.ci using the following files:
- key openshift-release-release-4.8-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.8-periodics.yaml
job-config-4.9 configmap in namespace ci at cluster app.ci using the following files:
- key openshift-release-release-4.9-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.9-periodics.yaml
step-registry configmap in namespace ci at cluster app.ci using the following files:
- key ipi-conf-aws-commands.sh using file ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-commands.sh

Details

In response to this:

From the test-infra docs, extra_refs is used for matching periodics, and:
$ git grep -A4 extra_refs ci-operator/jobs/openshift/release | head -n5
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml:  extra_refs:
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  - base_ref: master
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    org: openshift
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    repo: release
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  interval: 24h
seems to be on these periodics. Bumping the timeout will reduce the frequency of a few problem classes:
Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, this job:
 INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
 ...
 INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
 ...
 INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
 {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}
that job is an openshift/origin job, and I'm only bumping the timeout for openshift/release, but openshift/release is exposed to this same problem class.
Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, this job:
 INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
 ...
 INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
 ...
 {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}
Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include:

Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps.

Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags #1103), and which is used by core-services/ipi-deprovision/aws.sh. In this commit, I'm raising the expirationDate offset to 8h for step workflows, because those have access to the step timeouts. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps.

The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from alvaroaleman and petr-muller June 23, 2021 22:43

wking force-pushed the bump-prow-job-timeout branch from a70151f to 64d9389 Compare June 24, 2021 00:05

wking changed the title ~~core-services/prow/02_config: Bump openshift/release timeout to 8h~~ ci-operator/jobs/openshift/release: 8h timeout for chained updates Jun 24, 2021

petr-muller approved these changes Jun 24, 2021

View reviewed changes

openshift-ci bot assigned petr-muller Jun 24, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2021

openshift-merge-robot merged commit df6bff4 into openshift:master Jun 24, 2021

wking mentioned this pull request Aug 25, 2021

prowgen: support overriding prowjob timeout openshift/ci-tools#2294

Merged

1 task

wking deleted the bump-prow-job-timeout branch September 21, 2021 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638

ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638

Uh oh!

wking commented Jun 23, 2021

Uh oh!

wking commented Jun 24, 2021

Uh oh!

openshift-ci bot commented Jun 24, 2021

Uh oh!

openshift-ci bot commented Jun 24, 2021

Uh oh!

openshift-ci bot commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638

ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638

Uh oh!

Conversation

wking commented Jun 23, 2021

Uh oh!

wking commented Jun 24, 2021

Uh oh!

openshift-ci bot commented Jun 24, 2021

Uh oh!

openshift-ci bot commented Jun 24, 2021

Uh oh!

openshift-ci bot commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants