Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Feb 8, 2021

Make it easier to turn up issues in CI that show up in the CI clusters. Those clusters are mostly full of CI jobs with moderate CPU load and PodDisruptionBudgets that protect them from being evicted. They run for up to 4 hours before being terminated, and have a 30 minute termination grace period on top of that. We obviously can't use workload that's that slow to drain in a CI job, or our CI job would overshoot the limit and be killed. In this commit, I'm adding a new step (linked up just to the AWS update workflow for now) to install a deployment that asks for 100m of CPU but then (I think) consumes as much CPU as is available. It would be awesome if there was a test widget in some shipped container like tools that could be configured to consume a particular amount of CPU and memory, although I guess it would be hard to parameterize "regular" memory access. Anyhow, this is a first-pass WIP to feel out this general approach.

The manifest will subsequently be picked up and fed to the installer in the ipi-install-install step, or one of its close relatives. We'd be fine installing this as a day-2 manifest as well, but we don't have tooling in place for that yet, and installing it via the installer gives it more time to roll out into the compute nodes before the test step rolls around.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2021
@wking wking force-pushed the load-compute-during-updates branch from ab8e77a to 797ae8b Compare February 9, 2021 00:16
@abhinavdahiya
Copy link
Contributor

We'd be fine installing this as a day-2 manifest as well, but we don't have tooling in place for that yet

what kind of tooling is missing to apply manifests day 2?

the steps have access to oc using cli image, and KUBECONFIG for the cluster using some shared env variables.

oc apply -f some-dir/

would be good enough right?

@wking wking force-pushed the load-compute-during-updates branch from 797ae8b to e2f721f Compare February 9, 2021 05:33
…step

Make it easier to turn up issues in CI that show up in the CI
clusters.  Those clusters are mostly full of CI jobs with moderate CPU
load and PodDisruptionBudgets that protect them from being evicted.
They run for up to 4 hours before being terminated, and have a 30
minute termination grace period on top of that.  We obviously can't
use workload that's that slow to drain in a CI job, or our CI job
would overshoot the limit and be killed.  In this commit, I'm adding a
new step (linked up just to the AWS update workflow for now) to
install a deployment that asks for 100m of CPU but then (I think)
consumes as much CPU as is available.  It would be awesome if there
was a test widget in some shipped container like tools that could be
configured to consume a particular amount of CPU and memory, although
I guess it would be hard to parameterize "regular" memory access.
Anyhow, this is a first-pass WIP to feel out this general approach.

The manifest will subsequently be picked up and fed to the installer
in the ipi-install-install step, or one of its close relatives.  We'd
be fine installing this as a day-2 manifest as well, but we don't have
tooling in place for that yet, and installing it via the installer
gives it more time to roll out into the compute nodes before the test
step rolls around.
@wking wking force-pushed the load-compute-during-updates branch from e2f721f to 83a3c4d Compare February 9, 2021 05:35
@wking
Copy link
Member Author

wking commented Feb 9, 2021

what kind of tooling is missing to apply manifests day 2?

Nothing complicated, but the last run I took at this got reverted and I'm not sure why. Discussion in #10039 and #10053

@wking
Copy link
Member Author

wking commented Feb 9, 2021

@wking
Copy link
Member Author

wking commented Feb 9, 2021

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2021

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cloud-credential-operator/master/e2e-upgrade 83a3c4d link /test pj-rehearse
ci/prow/pj-rehearse 83a3c4d link /test pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Feb 16, 2021

Perf folks are going to handle CI for this use-case, and I don't have time to figure out why my approach isn't working ;)

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

Details

In response to this:

Perf folks are going to handle CI for this use-case, and I don't have time to figure out why my approach isn't working ;)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the load-compute-during-updates branch February 16, 2021 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants