Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Oct 20, 2020

Before this commit, pj-rehearse would just give up:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/12959/pull-ci-openshift-release-master-pj-rehearse/1318651994902106112/build-log.txt | tail -n2
time="2020-10-20T20:36:51Z" level=info msg="Created a rehearsal job to be submitted" org=openshift rehearsal-job=rehearse-12959-pull-ci-openshift-kubernetes-master-e2e-cmd repo=release target-job=pull-ci-openshift-kubernetes-master-e2e-cmd target-repo=openshift/kubernetes
time="2020-10-20T20:36:51Z" level=info msg="Would rehearse too many jobs, will not proceed" org=openshift rehearsal-jobs=68 rehearsal-threshold=45 repo=release

But skipping rehearsals for a change that touches tons of jobs, like we have done since ef4f47e (openshift/ci-operator-prowgen#78), makes it possible to break a whole lot of things without failing a warning rehearsal. With this commit, we pivot from "give up and test nothing" to "test as many of the touched jobs as we can afford".

It would be nice to intelligently truncate, e.g. if we touch a few types of jobs, or a few different workflows. But grouping seems complicated, so for this commit I'm just randomly shuffling and then dumping the tail.

CC @petr-muller

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 20, 2020
Before this commit, pj-rehearse would just give up [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/12959/pull-ci-openshift-release-master-pj-rehearse/1318651994902106112/build-log.txt | tail -n2
  time="2020-10-20T20:36:51Z" level=info msg="Created a rehearsal job to be submitted" org=openshift rehearsal-job=rehearse-12959-pull-ci-openshift-kubernetes-master-e2e-cmd repo=release target-job=pull-ci-openshift-kubernetes-master-e2e-cmd target-repo=openshift/kubernetes
  time="2020-10-20T20:36:51Z" level=info msg="Would rehearse too many jobs, will not proceed" org=openshift rehearsal-jobs=68 rehearsal-threshold=45 repo=release

But skipping rehearsals for a change that touches tons of jobs, like
we have done since ef4f47e (Limit rehearsals to 15 jobs,
2019-02-13, openshift/ci-operator-prowgen#78), makes it possible to
break a whole lot of things without failing a warning rehearsal.  With
this commit, we pivot from "give up and test nothing" to "test as many
of the touched jobs as we can afford".

It would be nice to intelligently truncate, e.g. if we touch a few
types of jobs, or a few different workflows.  But grouping seems
complicated, so for this commit I'm just randomly shuffling and then
dumping the tail.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/12959/pull-ci-openshift-release-master-pj-rehearse/1318651994902106112
@wking wking force-pushed the allow-some-rehearsals-for-changes-that-would-create-too-many branch from 50505f9 to f3352ae Compare October 20, 2020 22:56
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2020
@wking
Copy link
Member Author

wking commented Oct 20, 2020

I've pushed 50505f9 -> f3352ae to fix:

cmd/pj-rehearse/main.go:342:49: undefined: limit 

wking added a commit to wking/openshift-release that referenced this pull request Oct 20, 2020
Folks have been tuning this since b0d60e3 (Bump the limit of
rehearsed jobs, 2019-08-19, openshift#4789), most recently in 3f0040e (Raise
rehearsal-limit to 45 temporarily, 2020-02-03, openshift#6999).  But with [1],
we no longer need to fuss with this setting in order to see rehearsals
for changes that touch lots of jobs, so let it fall back to
pj-rehearse's default of 15.

[1]: openshift/ci-tools#1315
@wking
Copy link
Member Author

wking commented Oct 21, 2020

e2e:

error: some steps failed:
  * could not run steps: step [release:custom] failed: the following tags from the release could not be imported to stable-custom after five minutes:
- aws-machine-controllers from registry.svc.ci.openshift.org/origin/4.3-2020-10-19-215406@sha256:367bad806b883bb2531a3c05e9a3782a76d0be3abccf5b47b9d5724da59c3bc9
...

But that seems orthogonal. I think this is ready to land, just needs a new lgtm since my typo fix.

@alvaroaleman
Copy link
Contributor

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [alvaroaleman,stevekuznetsov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 57ad16e into openshift:master Oct 21, 2020
@wking wking deleted the allow-some-rehearsals-for-changes-that-would-create-too-many branch October 21, 2020 03:54
@petr-muller
Copy link
Member

I'm not a big fan of this, as this will result in gargantuan bot comments with old stale rehearsal failures - this is one of the most visible annoyances and source of confusion of users, and this PR will make that much worse.

@alvaroaleman
Copy link
Contributor

@petr-muller what about instead sorting them and using the first N?

@wking
Copy link
Member Author

wking commented Oct 22, 2020

Or pruning stale failures? Since that is an issue, even without this feeding in its churn.

@alvaroaleman
Copy link
Contributor

alvaroaleman commented Oct 22, 2020

Or pruning stale failures? Since that is an issue, even without this feeding in its churn.

The standard pruning logic is in prow and not easely changeable because that affects a lot of ppl and I am not fond of the idea of adding a github client and a custom pruning logic to the rehearsal tool

@petr-muller
Copy link
Member

Sorting helps but not entirely because the underlying set changes between runs, too

Pruning stale failures... I understand that is a wanted feature of Prow itself, to capture that some job ran (possibly with manual trigger only) in the past revision. With rehearsals it's different and annoying, but it would be quite a lot of work, b/c rehearsal tool does not currently interact with GH API in any way.

@petr-muller
Copy link
Member

Bloody hell Alvaro beat me :)

@wking
Copy link
Member Author

wking commented Oct 22, 2020

Pruning stale failures... I understand that is a wanted feature of Prow itself, to capture that some job ran (possibly with manual trigger only) in the past revision.

Looks like the The following tests failed, say /retest to rerun all failed tests: summary comments include a Commit column today. Maybe we can use that to exclude, or at least more obviously call out, jobs which had run on older commits?

@alvaroaleman
Copy link
Contributor

Maybe we can use that to exclude, or at least more obviously call out, jobs which had run on older commits?

I personally would like that as well (ref kubernetes/community#3621 (comment)) but:

  • That discussion is moving really really slow and we need buy in from others
  • It means we will need up to one more token per reporting (because we need to find out the current HEAD sha)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants