[WIP] search and retire stalled job statuses #131

droslean · 2019-04-08T14:45:24Z

After merging #122 we created the possibility of generating jobs that do not currently affect the current state of the PR. On retest or new commits, we want to make sure to retire all jobs that are not related to the present changes.

WIP -> working on tests.

test PR in release repo -> #3375

openshift-ci-robot · 2019-04-08T14:45:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droslean

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [droslean]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

droslean · 2019-04-08T14:45:52Z

/cc @openshift/developer-productivity-test-platform

stevekuznetsov · 2019-04-09T19:38:16Z

I do not think that the rehearsal tool should be communicating with GitHub.

droslean · 2019-04-10T11:49:19Z

@openshift/developer-productivity-test-platform
Another way should be to change the way how to configure the retest command for each job.
Do you think that we can add a retest command to each job depending on their commit SHA? Are those jobs will rerun in retest commend or test all?

petr-muller · 2019-04-10T13:04:46Z

I do not think that the rehearsal tool should be communicating with GitHub.

@stevekuznetsov Do you see any viable alternative to achieve the removal of stale rehearsal statuses? IIRC we were talking about using migratestatus code to remove statuses from the beginning. What's your concern, just general dislike of making rehearsals interact with GH directly, excessive API usage, something else?

stevekuznetsov · 2019-04-11T18:20:30Z

I do not want to:

provide users a DDOS mechanism for our GitHub tokens through /retest
break encapsulation of GitHub interaction in Prow

Also without some data on how often this happens and how bad the issue is I'm inclined to just leave this unfixed. If we deterministically find things to rehearse we don't have this problem unless someone changes the set of jobs they're changing in a PR, which should be fairly rare.

petr-muller · 2019-04-12T10:02:47Z

I do not want to:
* provide users a DDOS mechanism for our GitHub tokens through /retest

Fair point.

* break encapsulation of GitHub interaction in Prow

Isn't this basically the same encapsulation break like status reconciler or migratestatus tools? Both externally handle cases which plank doesn't.

Also without some data on how often this happens

I think I can get you such data - I can add it the the underlying JIRA.

and how bad the issue is I'm inclined to just leave this unfixed. If we deterministically find things to rehearse

That's the point - in some cases rehearse is not (intentionally!) deterministic - the code that selects jobs to rehearse when a template fails selects one of the jobs that use the template non-deterministically. Until we merged that, I knew about the problem and I agreed it's very rare, but once we included non-determinism, I was thinking we should really fix the stale context problem, because in some cases we create the situation intentionally.

we don't have this problem unless someone changes the set of jobs they're changing in a PR, which should be fairly rare.

When someone changes jobs they're changing, we don't have a problem - in that case they create new commits and the "stale" statuses stay tied to the old commits and disappear from the PR (that's the reason why we originally decided not to deal with old statuses). We only have a problem when 1) pj-rehearse runs multiple times for the same commit (re-tests) and selects different jobs to rehearse between executions. This only happens in these cases:

when we change ci-operator itself. This is a one-time event, rare, and I agree we do not need to bother
when some unchanged job, but affected by the PR (like, using a changed ci-op config) is deleted between the rehearse runs. Super-rare
We are nondeterministic ourselves.

I think we could ignore 1+2, but 3 made me think we should address the issue.

stevekuznetsov · 2019-04-15T15:04:48Z

I think there's a simple fix to 3, right? Just use a deterministic index into the list of possible jobs.

petr-muller · 2019-04-15T16:51:18Z

I think there's a simple fix to 3, right? Just use a deterministic index into the list of possible jobs.

Won't be 100% reliable because the set of repos & jobs change, which may offset the originally selected job.

stevekuznetsov · 2019-04-16T00:46:46Z

Sure, but we can be 90% and it's fine. If it's a real issue we could hash all possible jobs and the input changes and choose based on edit distance or something...

stevekuznetsov · 2019-04-16T00:47:43Z

The scope of mitigations for 3 are pretty small in comparison to this change. We could just choose an index one third of the way through the list, for all I care -- the semi-deterministic approach will be fine in most cases and can be really simple.

petr-muller · 2019-04-16T09:30:55Z

/shrug
I'll track the occurrences in the rehearsal metrics so we know how often this actually happens.

stevekuznetsov · 2019-04-16T16:11:06Z

/close

openshift-ci-robot · 2019-04-16T16:11:09Z

@stevekuznetsov: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Stale rehearsals happen when the rehearsal tool selects different jobs to rehearse for a single commit (so a /retest needs to be involved). These rehearsals are confusing when they fail, because they are very hard to get rid of (they cannot be separately rerun, and pj-rehearse does not attempt to re-run them either). This change makes the metrics tool track occurrences of this. Related to: - DPTP-368 - openshift#131

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 8, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2019

openshift-ci-robot requested review from bbguimaraes and petr-muller April 8, 2019 14:46

search and retire stalled job statuses

6017b04

openshift-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Apr 16, 2019

openshift-ci-robot closed this Apr 16, 2019

petr-muller mentioned this pull request Apr 18, 2019

Rehearsal metrics: Find and count stale rehearsal occurences #148

Merged

[WIP] search and retire stalled job statuses #131

[WIP] search and retire stalled job statuses #131

Uh oh!

Conversation

droslean commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 8, 2019

Uh oh!

droslean commented Apr 8, 2019

Uh oh!

stevekuznetsov commented Apr 9, 2019

Uh oh!

droslean commented Apr 10, 2019

Uh oh!

petr-muller commented Apr 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevekuznetsov commented Apr 11, 2019

Uh oh!

petr-muller commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevekuznetsov commented Apr 15, 2019

Uh oh!

petr-muller commented Apr 15, 2019

Uh oh!

stevekuznetsov commented Apr 16, 2019

Uh oh!

stevekuznetsov commented Apr 16, 2019

Uh oh!

petr-muller commented Apr 16, 2019

Uh oh!

stevekuznetsov commented Apr 16, 2019

Uh oh!

openshift-ci-robot commented Apr 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

droslean commented Apr 8, 2019 •

edited

Loading

petr-muller commented Apr 10, 2019 •

edited

Loading

petr-muller commented Apr 12, 2019 •

edited

Loading