tide: serial merges should occur when batches fail #13551

BenTheElder · 2019-07-23T07:10:51Z

What happened: Tide spent many hours retesting a failing batch of github.com/kubernetes/enhancements PRs 1111, 1134, 1151, 1153, 1155. One PR caused a test to fail, and no PRs merged during that time.

https://prow.k8s.io/tide-history?repo=kubernetes%2Fenhancements
https://prow.k8s.io/?repo=kubernetes%2Fenhancements&type=batch

What you expected to happen: Tide should merge one of the passing PRs instead of retesting the same exact batch > 270 times without merging anything or trying a different batch. (And many more times before that when fewer PRs were ready to merge).

How to reproduce it (as minimally and precisely as possible): ¯\_(ツ)_/¯

Please provide links to example occurrences, if any: In description above.

Anything else we need to know?:
/area prow
/area prow/tide

BenTheElder · 2019-07-23T07:14:46Z

After kicking out the PR causing the batch failure with an /lgtm cancel the other PRs merged, so this particular instance is mitigated for the moment.

cjwagner · 2019-07-23T16:06:10Z

Thanks for the issue Ben.
Tide triggers a serial test whenever it sees a batch running and no up-to-date pending or passing serial test. Based on the timestamps it looks like Tide was continually triggering a new batch because the existing one was failing before the next Tide sync and we prioritize triggering batches over trigger serial tests so we never got the chance to trigger serial tests.

I think this can be addressed by prioritizing triggering serial tests or by making Tide trigger both batch tests and serial tests in the same sync loop when appropriate.
/assign @stevekuznetsov
WDYT?

stevekuznetsov · 2019-07-23T21:26:37Z

We should probably trigger both.

spiffxp · 2019-07-26T18:08:56Z

/sig testing

alvaroaleman · 2019-09-08T11:32:50Z

IMHO we should just stop silently swallowing batch test errors and re-testing forever and instead report them to the PRs, so tide doen't consider the PRs again until someone issues a /retest to make their contexts green again: #12216 (comment)

stevekuznetsov · 2019-09-09T09:38:10Z

Isn't that against the premise of tide though? A flake in a batch shouldn't require human intervention IMO. Even the /retest on LGTM + approve + flake is onerous and has been ~automated

BenTheElder · 2019-09-11T16:03:19Z

The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially.

alvaroaleman · 2019-09-11T18:15:13Z

The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially.

But still introduce some jitter as the /retest command gets posted after some delay

fejta-bot · 2019-12-10T18:57:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-01-09T19:43:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-02-08T20:24:37Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-02-08T20:24:44Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BenTheElder · 2020-06-17T18:28:44Z

this came up again today, we're seeing an ever-expanding batch in k/k
/lifecycle frozen

alvaroaleman · 2020-06-18T15:42:14Z

One failure mode where this happens is if a batch of a given set of prs failed but in the meantime, a new PR became eligible for retesting and merging. We will then start the batch with the new set rather than merging the one pr. This is something I guess we could fix.
This statement was wrong. We prioritize merging a single PR over creating a batch test. But when we trigger a re-test, we trigger either a batch or a single PR but not both. It would probably be good to do both to at least make some progress in case of failing batches.

Completely regardless of that I think we must introduce some jitter. If a batch fails, kick out any pr of it. Ppl can still retest if they think it was not because of the pr and the retesting can also be triggered automatically via commenter.

Origin has some high-flake tests, but I'm not sure which ones. In combination with [1] that means that origin progress can wedge on repeatedly failing batches that nobody can /override. Disable batching [2], which will have two benefits: * No failing batch jobs to block individual PRs from merging, so we're able to make incremental progress. * Lots of PRs will run tests, so even high-flake tests are likely to pass on at least one approved PR. And maintainers can /override if they feel the need. And some drawbacks: * Lots of tests are going to get run, as the retest bot launches test on PR A that will go obsolete when PR B lands first. * Merging PRs one at a time is slow (with our slow CI jobs), and Tide may not be able to keep up with the rate at which PRs are approved. Ideally this gets reverted once the origin maintainers identify the flaking jobs and fix them or set them 'optional: true'. [1]: kubernetes/test-infra#13551 [2]: https://github.com/kubernetes/test-infra/blob/bb4bfe2c0e16c8e93ff7fc0ba4d8c37cdee2a7f5/prow/config/tide.go#L142-L148

fejta · 2020-06-20T07:40:02Z

The original intent here is that:

we always have both a batch and single retest running at the same time
we wait until the batch completes and merge it if it passes
otherwise we look at the single and merge that if it passes
repeat ad nauseam

IMO any other behavior than this is a bug -- such as there being multiple approved PRs and no batch and/or not scheduling a serial run.

BenTheElder · 2020-07-09T22:52:46Z

It seems openshift disabled batching to work around openshift/release#9786 (x-ref-ed this issue above)

alvaroaleman · 2020-07-09T22:56:10Z

@BenTheElder very briefly, it was subsequently re-activated in openshift/release#9831

BenTheElder · 2020-11-16T10:54:21Z

Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.

alvaroaleman · 2020-11-16T14:11:34Z

Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing.

The problem there is that the batch tests fail very quickly. We always start either a batch (if available and none running yet) or a serial retest. Since the batch always fails until Tides next sync we never get to the point of starting a serial retest there. You can see that nicely on https://prow.k8s.io/tide-history?repo=kubernetes%2Fnode-problem-detector where a batch test is started every two minutes which matches prow.k8s.io ttide sync period

BenTheElder · 2020-11-16T20:05:33Z

This doesn't seem like a super unlikely failure mode though, we've seen stuff like this before. I think tide should run a serial test in the background to prevent infinitely spamming broken batches with no progress.

Even if it wasn't done concurrently ordinarily, since we do record history, we could detect repeated batches and opt to start a serial job instead.

BenTheElder · 2021-06-05T04:38:22Z

latest variation: a batch job has been mysteriously perma-"pending" with no results so we've ceased merging in kubernetes #22432

alvaroaleman · 2021-06-05T21:32:22Z

latest variation: a batch job has been mysteriously perma-"pending" with no results so we've ceased merging in kubernetes #22432

Not doing serial merges when we wait for a batch job to finish is intended, as it would invalidate the batch. The perma pending is ofc not, though.

BenTheElder · 2024-05-30T19:36:44Z

If I see issues like this in the future I'll file an issue with kubernetes-sigs/prow and link back to past context.
Currently I'm not aware of this being a big problem. Closing out old inactive prow issues in this repo now that we've moved prow to it's own repo.

BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2019

k8s-ci-robot added area/prow Issues or PRs related to prow area/prow/tide Issues or PRs related to prow's tide component labels Jul 23, 2019

BenTheElder mentioned this issue Jul 23, 2019

protect kubernetes community owned API groups in CRDs kubernetes/enhancements#1111

Merged

BenTheElder mentioned this issue Jul 23, 2019

Update "less object serializations KEP" and move it to implementable state kubernetes/enhancements#1153

Merged

BenTheElder assigned cjwagner Jul 23, 2019

k8s-ci-robot assigned stevekuznetsov Jul 23, 2019

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 26, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 9, 2020

k8s-ci-robot closed this as completed Feb 8, 2020

BenTheElder reopened this Jun 17, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 17, 2020

wking mentioned this issue Jun 18, 2020

core-services/prow/02_config: Disable batching for openshift/origin openshift/release#9786

Merged

spiffxp mentioned this issue Jul 10, 2020

Very few PRs are merging kubernetes/kubernetes#92937

Closed

BenTheElder mentioned this issue Nov 16, 2020

Batch merging broken for this repo by use of PULL_NUMBER in scripts kubernetes/node-problem-detector#495

Closed

BenTheElder mentioned this issue Jun 5, 2021

tide stopped merging Kubernetes PRs for 8 hours and counting #22432

Closed

BenTheElder closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tide: serial merges should occur when batches fail #13551

tide: serial merges should occur when batches fail #13551

BenTheElder commented Jul 23, 2019

BenTheElder commented Jul 23, 2019

cjwagner commented Jul 23, 2019

stevekuznetsov commented Jul 23, 2019

spiffxp commented Jul 26, 2019

alvaroaleman commented Sep 8, 2019

stevekuznetsov commented Sep 9, 2019

BenTheElder commented Sep 11, 2019

alvaroaleman commented Sep 11, 2019

fejta-bot commented Dec 10, 2019

fejta-bot commented Jan 9, 2020

fejta-bot commented Feb 8, 2020

k8s-ci-robot commented Feb 8, 2020

BenTheElder commented Jun 17, 2020

alvaroaleman commented Jun 18, 2020 •

edited

Loading

fejta commented Jun 20, 2020

BenTheElder commented Jul 9, 2020

alvaroaleman commented Jul 9, 2020

BenTheElder commented Nov 16, 2020

alvaroaleman commented Nov 16, 2020

BenTheElder commented Nov 16, 2020 •

edited

Loading

BenTheElder commented Jun 5, 2021

alvaroaleman commented Jun 5, 2021

BenTheElder commented May 30, 2024

tide: serial merges should occur when batches fail #13551

tide: serial merges should occur when batches fail #13551

Comments

BenTheElder commented Jul 23, 2019

BenTheElder commented Jul 23, 2019

cjwagner commented Jul 23, 2019

stevekuznetsov commented Jul 23, 2019

spiffxp commented Jul 26, 2019

alvaroaleman commented Sep 8, 2019

stevekuznetsov commented Sep 9, 2019

BenTheElder commented Sep 11, 2019

alvaroaleman commented Sep 11, 2019

fejta-bot commented Dec 10, 2019

fejta-bot commented Jan 9, 2020

fejta-bot commented Feb 8, 2020

k8s-ci-robot commented Feb 8, 2020

BenTheElder commented Jun 17, 2020

alvaroaleman commented Jun 18, 2020 • edited Loading

fejta commented Jun 20, 2020

BenTheElder commented Jul 9, 2020

alvaroaleman commented Jul 9, 2020

BenTheElder commented Nov 16, 2020

alvaroaleman commented Nov 16, 2020

BenTheElder commented Nov 16, 2020 • edited Loading

BenTheElder commented Jun 5, 2021

alvaroaleman commented Jun 5, 2021

BenTheElder commented May 30, 2024

alvaroaleman commented Jun 18, 2020 •

edited

Loading

BenTheElder commented Nov 16, 2020 •

edited

Loading