Skip to content

Conversation

@cgwalters
Copy link
Member

Today we spawn three clusters every time someone does a git push.
That's pretty nuts, and IMO doing this type of thing across
all the repos is both an unnecessary waste of money, and also
exacerbates issues with AWS resource limits.

I sometimes hesitate at doing git push to fix a typo in a comment
as part of a PR review because I know that push may end up
stealing an AWS NAT limit that's going to be used by an actually
important job.

For the MCO specifically many of our changes are extremely
unlikely to break e2e-aws, and if they did they'd break
e2e-aws-op too.

So we'll do e.g. /test e2e-aws on demand in PRs, or
/test all, etc.

Today we spawn *three* clusters every time someone does a `git push`.
That's pretty nuts, and IMO doing this type of thing across
all the repos is both an unnecessary waste of money, and also
exacerbates issues with AWS resource limits.

I sometimes hesitate at doing `git push` to fix a typo in a comment
as part of a PR review because I know that push may end up
stealing an AWS NAT limit that's going to be used by an actually
important job.

For the MCO specifically many of our changes are extremely
unlikely to break `e2e-aws`, and if they did they'd break
`e2e-aws-op` too.

So we'll do e.g. `/test e2e-aws` on demand in PRs, or
`/test all`, etc.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2019
@cgwalters
Copy link
Member Author

/hold
Until MCO team approves

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2019
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 30, 2019
@wking
Copy link
Member

wking commented Apr 30, 2019

More detail from @cgwalters for folks like me who are shaky on these job properties:

There's a difference in Prow between always_run and optional. The jobs are still not optional, i.e. they are required for merges. This is not at all turning off tests, it's just making them not run on every git push.

@cgwalters
Copy link
Member Author

cgwalters commented Apr 30, 2019

We're debating whether or not this actually works. The intended semantics are still to have all the jobs run after slash-lgtm.

My reading of the Prow docs implied it was, but I could be wrong! (Do we have a "test repo" setup where we can play with job configs?)

Or with Prow and this type of config today do we need to do

 / test all
 / lgtm

when setting up to merge?

@openshift-ci-robot
Copy link
Contributor

@cgwalters: you cannot LGTM your own PR.

Details

In response to this:

We're debating whether or not this actually works. The intended semantics are still to have all the jobs run after /lgtm.

My reading of the Prow docs implied it was, but I could be wrong! (Do we have a "test repo" setup where we can play with job configs?)

Or with Prow and this type of config today do we need to do

/test all
/lgtm

when setting up to merge?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

I'm doing this for a whole lot of reasons; another example is that right now at the moment I'd like to iterate on
openshift/machine-config-operator#682
and it's launching upgrade and regular e2e-aws jobs too which is a waste.

@sdodson
Copy link
Member

sdodson commented Apr 30, 2019

/approve
👍 to anything that reduces wasteful cluster builds. We seem to be pushing as many concurrent cluster builds as possible and then wondering why things fail.

As an aside, we should also be looking into measuring the cloud API requirements for our cluster tests so that we can determine what the maximum number of jobs we can run concurrently while ensuring a high rate of success.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@patrickdillon
Copy link
Contributor

patrickdillon commented Apr 30, 2019

Interesting idea and seems possible. I think you need to create a trigger based on /lgtm: https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#triggering-jobs-with-comments
You might also need required: true rather than optional: false.

@runcom
Copy link
Member

runcom commented Apr 30, 2019

/hold
Until MCO team approves

it looks great to me, assuming we clear out confusion on how triggers work

@cgwalters
Copy link
Member Author

You might also need required: true

I don't see such a key in the Prow docs nor in any of our jobs.

@kikisdeliveryservice
Copy link
Contributor

I'm super on board with not running e2e on each and every push before my PR is ready for review and would love for these tests to be on-demand until I'm looking for a final approval.

@patrickdillon
Copy link
Contributor

I don't see such a key in the Prow docs nor in any of our jobs.
My mistake. Sorry for the confusion. Just reread the docs and you have it right with optional.

@smarterclayton
Copy link
Contributor

I don't know why you are worried about this. You aren't the problem, the infra is the problem. Our spend is a tiny fraction of the cost of business. Not a significant worry.

@smarterclayton
Copy link
Contributor

/hold

@stevekuznetsov
Copy link
Contributor

@stevekuznetsov
Copy link
Contributor

Some weird side effects of what you are doing:

  • if a job does not run for some reason, you will not be able to use /retest to trigger it, as Prow does not know it should be there
  • tide will only require the statuses if they exist, so you will be able to merge without ever triggering these manually, but once you trigger them manually if they're failed we will not merge

I don't think this is what you want to do.

@smarterclayton
Copy link
Contributor

A cluster run costs us between $0.25 and $0.50

@stevekuznetsov
Copy link
Contributor

If we get some idea of how many resources a job uses, we can have a throttle on the total number of concurrent jobs hitting the AWS API in a given zone.

@smarterclayton
Copy link
Contributor

The current priority is

  1. have AWS increase rate limits (real solution)
  2. potentially drop to 2 zones in CI (distant second solution)
  3. optimize image pulls to come from AWS (where the actual money goes)

@wking
Copy link
Member

wking commented May 1, 2019

  1. potentially drop to 2 zones in CI (distant second solution)

But also something we can do ourselves without waiting on AWS; #3615.

@sdodson
Copy link
Member

sdodson commented May 1, 2019

If we get some idea of how many resources a job uses

@stevekuznetsov which team would be best equipped to measure that?

@stevekuznetsov
Copy link
Contributor

@sdodson it's work that DPTP may take on for 4.2

@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/build-farm/build01-dry 6afea5b link /test build01-dry

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member

/close

Looks stale, please reopen & rebase if still needed.

@openshift-ci-robot
Copy link
Contributor

@petr-muller: Closed this PR.

Details

In response to this:

/close

Looks stale, please reopen & rebase if still needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants