Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jun 23, 2021

From the test-infra docs, extra_refs is used for matching periodics, and:

$ git grep -A4 extra_refs ci-operator/jobs/openshift/release | head -n5
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml:  extra_refs:
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  - base_ref: master
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    org: openshift
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    repo: release
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  interval: 24h

seems to be on these periodics. Bumping the timeout will reduce the frequency of a few problem classes:

  • Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, this job:

      INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
      ...
      INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
      ...
      INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
      {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}
    

    that job is an openshift/origin job, and I'm only bumping the timeout for openshift/release, but openshift/release is exposed to this same problem class.

  • Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, this job:

      INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
      ...
      INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
      ...
      {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}
    

Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include:

  • Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps.

  • Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags #1103), and which is used by core-services/ipi-deprovision/aws.sh. In this commit, I'm raising the expirationDate offset to 8h for step workflows, because those have access to the step timeouts. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps.

    The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit.

Bumping the timeout will reduce the frequency of a few problem
classes:

* Jobs which take a while acquire a Boskos lease, and then aren't left
  with enough ProwJob time remaining to complete their test.  For
  example, [1]:

    INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
    ...
    INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
    ...
    INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
    {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}

  [1] is an openshift/origin job, and I'm only bumping the timeout for
  the chained-update periodics, but those periodics are exposed to
  this same problem class.

* Jobs where the test is slow enough that it does not fit into the
  ProwJob time, regardless of how quickly they acquire a ProwJob
  lease.  For example, [2]:

    INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
    ...
    INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
    ...
    {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}

Downsides to raising the timeout are mostly increases to CI cost and
reduced CI capacity due to resource consumption.  Possible mechanisms
include:

* Jobs that actually hang might consume resources for longer.  These
  can be mitigated with lower-level timeouts, e.g. on the steps [3].

* Jobs that were aborted without completing teardown.  These can be
  mitigated by improving workflows to make it more likely that the
  teardown step gets completed when an earlier step times out [4].
  These can also be mitigated by per-job knobs on leaked resource
  reaping, like AWS's expirationDate, which we've set since
  f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags,
  2018-07-26, openshift#1103), and which is used by
  core-services/ipi-deprovision/aws.sh.  In this commit, I'm raising
  the expirationDate offset to 8h for step workflows, because those
  have access to the step timeouts [3].  But I'm leaving it at 4h for
  the templates, because those do not.  Slow AWS jobs which continue
  to use templates should migrate to steps.

  The GCP installer currently lacks the ability to create that sort of
  resource tag, so I'm adjusting some hard-coded defaults in that file
  to keep up with this commit.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26235/pull-ci-openshift-origin-master-e2e-aws-fips/1405162123687890944
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1407709521396109312
[3]: https://docs.ci.openshift.org/docs/architecture/timeouts/#step-registry-test-process-timeouts
[4]: https://docs.ci.openshift.org/docs/architecture/timeouts/#how-interruptions-may-be-handled
@wking wking force-pushed the bump-prow-job-timeout branch from a70151f to 64d9389 Compare June 24, 2021 00:05
@wking wking changed the title core-services/prow/02_config: Bump openshift/release timeout to 8h ci-operator/jobs/openshift/release: 8h timeout for chained updates Jun 24, 2021
@wking
Copy link
Member Author

wking commented Jun 24, 2021

@petr-muller pointed out the per-job syntax. I've opened openshift/ci-docs#162 to make that more discoverable, and pivoted this PR to just bump the chained-update jobs with a70151f7f8 -> 64d9389.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2021

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node 64d9389 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-image-ecosystem 64d9389 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.1/e2e-aws-builds 64d9389 link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration 64d9389 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws 64d9389 link /test pj-rehearse
ci/rehearse/redhat-developer/jenkins-operator/main/e2e 64d9389 link /test pj-rehearse
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator 64d9389 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.2/e2e-cmd 64d9389 link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7 64d9389 link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/release-4.9/e2e-aws-manual-oidc 64d9389 link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-calico 64d9389 link /test pj-rehearse
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-ovn-hybrid-step-registry 64d9389 link /test pj-rehearse
ci/rehearse/openshift/installer/release-4.9/e2e-aws-upgrade 64d9389 link /test pj-rehearse
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-aws-ccm 64d9389 link /test pj-rehearse
ci/rehearse/openshift/origin/release-4.9/e2e-aws-disruptive 64d9389 link /test pj-rehearse
ci/prow/pj-rehearse 64d9389 link /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node 64d9389 link /test pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2021
@openshift-merge-robot openshift-merge-robot merged commit df6bff4 into openshift:master Jun 24, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2021

@wking: Updated the following 5 configmaps:

  • job-config-4.6 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-release-4.6-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.6-periodics.yaml
  • job-config-4.7 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-release-4.7-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.7-periodics.yaml
  • job-config-4.8 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-release-4.8-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.8-periodics.yaml
  • job-config-4.9 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-release-4.9-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-release-4.9-periodics.yaml
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key ipi-conf-aws-commands.sh using file ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-commands.sh
Details

In response to this:

From the test-infra docs, extra_refs is used for matching periodics, and:

$ git grep -A4 extra_refs ci-operator/jobs/openshift/release | head -n5
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml:  extra_refs:
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  - base_ref: master
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    org: openshift
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-    repo: release
ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml-  interval: 24h

seems to be on these periodics. Bumping the timeout will reduce the frequency of a few problem classes:

  • Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, this job:

     INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
     ...
     INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
     ...
     INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
     {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}
    

    that job is an openshift/origin job, and I'm only bumping the timeout for openshift/release, but openshift/release is exposed to this same problem class.

  • Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, this job:

     INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
     ...
     INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
     ...
     {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}
    

Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include:

  • Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps.

  • Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags #1103), and which is used by core-services/ipi-deprovision/aws.sh. In this commit, I'm raising the expirationDate offset to 8h for step workflows, because those have access to the step timeouts. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps.

    The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the bump-prow-job-timeout branch September 21, 2021 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants