Skip to content

tide: serial merges should occur when batches fail #13551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BenTheElder opened this issue Jul 23, 2019 · 23 comments
Closed

tide: serial merges should occur when batches fail #13551

BenTheElder opened this issue Jul 23, 2019 · 23 comments
Assignees
Labels
area/prow/tide Issues or PRs related to prow's tide component area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@BenTheElder
Copy link
Member

What happened: Tide spent many hours retesting a failing batch of github.com/kubernetes/enhancements PRs 1111, 1134, 1151, 1153, 1155. One PR caused a test to fail, and no PRs merged during that time.

https://prow.k8s.io/tide-history?repo=kubernetes%2Fenhancements
https://prow.k8s.io/?repo=kubernetes%2Fenhancements&type=batch

What you expected to happen: Tide should merge one of the passing PRs instead of retesting the same exact batch > 270 times without merging anything or trying a different batch. (And many more times before that when fewer PRs were ready to merge).

How to reproduce it (as minimally and precisely as possible): ¯\_(ツ)_/¯

Please provide links to example occurrences, if any: In description above.

Anything else we need to know?:
/area prow
/area prow/tide

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2019
@k8s-ci-robot k8s-ci-robot added area/prow Issues or PRs related to prow area/prow/tide Issues or PRs related to prow's tide component labels Jul 23, 2019
@BenTheElder
Copy link
Member Author

After kicking out the PR causing the batch failure with an /lgtm cancel the other PRs merged, so this particular instance is mitigated for the moment.

@cjwagner
Copy link
Member

Thanks for the issue Ben.
Tide triggers a serial test whenever it sees a batch running and no up-to-date pending or passing serial test. Based on the timestamps it looks like Tide was continually triggering a new batch because the existing one was failing before the next Tide sync and we prioritize triggering batches over trigger serial tests so we never got the chance to trigger serial tests.

I think this can be addressed by prioritizing triggering serial tests or by making Tide trigger both batch tests and serial tests in the same sync loop when appropriate.
/assign @stevekuznetsov
WDYT?

@stevekuznetsov
Copy link
Contributor

We should probably trigger both.

@spiffxp
Copy link
Member

spiffxp commented Jul 26, 2019

/sig testing

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 26, 2019
@alvaroaleman
Copy link
Member

IMHO we should just stop silently swallowing batch test errors and re-testing forever and instead report them to the PRs, so tide doen't consider the PRs again until someone issues a /retest to make their contexts green again: #12216 (comment)

@stevekuznetsov
Copy link
Contributor

Isn't that against the premise of tide though? A flake in a batch shouldn't require human intervention IMO. Even the /retest on LGTM + approve + flake is onerous and has been ~automated

@BenTheElder
Copy link
Member Author

The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially.

@alvaroaleman
Copy link
Member

The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially.

But still introduce some jitter as the /retest command gets posted after some delay

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 9, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member Author

this came up again today, we're seeing an ever-expanding batch in k/k
/lifecycle frozen

@BenTheElder BenTheElder reopened this Jun 17, 2020
@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 17, 2020
@alvaroaleman
Copy link
Member

alvaroaleman commented Jun 18, 2020

One failure mode where this happens is if a batch of a given set of prs failed but in the meantime, a new PR became eligible for retesting and merging. We will then start the batch with the new set rather than merging the one pr. This is something I guess we could fix.
This statement was wrong. We prioritize merging a single PR over creating a batch test. But when we trigger a re-test, we trigger either a batch or a single PR but not both. It would probably be good to do both to at least make some progress in case of failing batches.

Completely regardless of that I think we must introduce some jitter. If a batch fails, kick out any pr of it. Ppl can still retest if they think it was not because of the pr and the retesting can also be triggered automatically via commenter.

wking added a commit to wking/openshift-release that referenced this issue Jun 18, 2020
Origin has some high-flake tests, but I'm not sure which ones.  In
combination with [1] that means that origin progress can wedge on
repeatedly failing batches that nobody can /override.  Disable
batching [2], which will have two benefits:

* No failing batch jobs to block individual PRs from merging, so we're
  able to make incremental progress.
* Lots of PRs will run tests, so even high-flake tests are likely to
  pass on at least one approved PR.  And maintainers can /override if
  they feel the need.

And some drawbacks:

* Lots of tests are going to get run, as the retest bot launches test
  on PR A that will go obsolete when PR B lands first.
* Merging PRs one at a time is slow (with our slow CI jobs), and Tide
  may not be able to keep up with the rate at which PRs are approved.

Ideally this gets reverted once the origin maintainers identify the
flaking jobs and fix them or set them 'optional: true'.

[1]: kubernetes/test-infra#13551
[2]: https://github.com/kubernetes/test-infra/blob/bb4bfe2c0e16c8e93ff7fc0ba4d8c37cdee2a7f5/prow/config/tide.go#L142-L148
@fejta
Copy link
Contributor

fejta commented Jun 20, 2020

The original intent here is that:

  • we always have both a batch and single retest running at the same time
  • we wait until the batch completes and merge it if it passes
  • otherwise we look at the single and merge that if it passes
  • repeat ad nauseam

IMO any other behavior than this is a bug -- such as there being multiple approved PRs and no batch and/or not scheduling a serial run.

@BenTheElder
Copy link
Member Author

It seems openshift disabled batching to work around openshift/release#9786 (x-ref-ed this issue above)

@alvaroaleman
Copy link
Member

@BenTheElder very briefly, it was subsequently re-activated in openshift/release#9831

@BenTheElder
Copy link
Member Author

Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.

@alvaroaleman
Copy link
Member

Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.

The problem there is that the batch tests fail very quickly. We always start either a batch (if available and none running yet) or a serial retest. Since the batch always fails until Tides next sync we never get to the point of starting a serial retest there. You can see that nicely on https://prow.k8s.io/tide-history?repo=kubernetes%2Fnode-problem-detector where a batch test is started every two minutes which matches prow.k8s.io ttide sync period

@BenTheElder
Copy link
Member Author

BenTheElder commented Nov 16, 2020

This doesn't seem like a super unlikely failure mode though, we've seen stuff like this before. I think tide should run a serial test in the background to prevent infinitely spamming broken batches with no progress.

Even if it wasn't done concurrently ordinarily, since we do record history, we could detect repeated batches and opt to start a serial job instead.

@BenTheElder
Copy link
Member Author

latest variation: a batch job has been mysteriously perma-"pending" with no results so we've ceased merging in kubernetes #22432

@alvaroaleman
Copy link
Member

latest variation: a batch job has been mysteriously perma-"pending" with no results so we've ceased merging in kubernetes #22432

Not doing serial merges when we wait for a batch job to finish is intended, as it would invalidate the batch. The perma pending is ofc not, though.

@BenTheElder
Copy link
Member Author

If I see issues like this in the future I'll file an issue with kubernetes-sigs/prow and link back to past context.
Currently I'm not aware of this being a big problem. Closing out old inactive prow issues in this repo now that we've moved prow to it's own repo.

@BenTheElder BenTheElder closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow/tide Issues or PRs related to prow's tide component area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

8 participants