-
Notifications
You must be signed in to change notification settings - Fork 2.7k
tide: serial merges should occur when batches fail #13551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
After kicking out the PR causing the batch failure with an |
Thanks for the issue Ben. I think this can be addressed by prioritizing triggering serial tests or by making Tide trigger both batch tests and serial tests in the same sync loop when appropriate. |
We should probably trigger both. |
/sig testing |
IMHO we should just stop silently swallowing batch test errors and re-testing forever and instead report them to the PRs, so tide doen't consider the PRs again until someone issues a |
Isn't that against the premise of |
The /retest on LGTM+approve+flake is automated though? So if it did report the failure we'd auto retest it, serially. |
But still introduce some jitter as the |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
this came up again today, we're seeing an ever-expanding batch in k/k |
Completely regardless of that I think we must introduce some jitter. If a batch fails, kick out any pr of it. Ppl can still retest if they think it was not because of the pr and the retesting can also be triggered automatically via commenter. |
Origin has some high-flake tests, but I'm not sure which ones. In combination with [1] that means that origin progress can wedge on repeatedly failing batches that nobody can /override. Disable batching [2], which will have two benefits: * No failing batch jobs to block individual PRs from merging, so we're able to make incremental progress. * Lots of PRs will run tests, so even high-flake tests are likely to pass on at least one approved PR. And maintainers can /override if they feel the need. And some drawbacks: * Lots of tests are going to get run, as the retest bot launches test on PR A that will go obsolete when PR B lands first. * Merging PRs one at a time is slow (with our slow CI jobs), and Tide may not be able to keep up with the rate at which PRs are approved. Ideally this gets reverted once the origin maintainers identify the flaking jobs and fix them or set them 'optional: true'. [1]: kubernetes/test-infra#13551 [2]: https://github.com/kubernetes/test-infra/blob/bb4bfe2c0e16c8e93ff7fc0ba4d8c37cdee2a7f5/prow/config/tide.go#L142-L148
The original intent here is that:
IMO any other behavior than this is a bug -- such as there being multiple approved PRs and no batch and/or not scheduling a serial run. |
It seems openshift disabled batching to work around openshift/release#9786 (x-ref-ed this issue above) |
@BenTheElder very briefly, it was subsequently re-activated in openshift/release#9831 |
Apparently present again in kubernetes/node-problem-detector#495, use of PULL_NUMBER breaks batch runs, nothing has merged but there are 7 PRs stuck in endless batch testing. |
The problem there is that the batch tests fail very quickly. We always start either a batch (if available and none running yet) or a serial retest. Since the batch always fails until Tides next sync we never get to the point of starting a serial retest there. You can see that nicely on https://prow.k8s.io/tide-history?repo=kubernetes%2Fnode-problem-detector where a batch test is started every two minutes which matches prow.k8s.io t |
This doesn't seem like a super unlikely failure mode though, we've seen stuff like this before. I think tide should run a serial test in the background to prevent infinitely spamming broken batches with no progress. Even if it wasn't done concurrently ordinarily, since we do record history, we could detect repeated batches and opt to start a serial job instead. |
latest variation: a batch job has been mysteriously perma-"pending" with no results so we've ceased merging in kubernetes #22432 |
Not doing serial merges when we wait for a batch job to finish is intended, as it would invalidate the batch. The perma pending is ofc not, though. |
If I see issues like this in the future I'll file an issue with kubernetes-sigs/prow and link back to past context. |
What happened: Tide spent many hours retesting a failing batch of github.com/kubernetes/enhancements PRs 1111, 1134, 1151, 1153, 1155. One PR caused a test to fail, and no PRs merged during that time.
https://prow.k8s.io/tide-history?repo=kubernetes%2Fenhancements
https://prow.k8s.io/?repo=kubernetes%2Fenhancements&type=batch
What you expected to happen: Tide should merge one of the passing PRs instead of retesting the same exact batch > 270 times without merging anything or trying a different batch. (And many more times before that when fewer PRs were ready to merge).
How to reproduce it (as minimally and precisely as possible): ¯\_(ツ)_/¯
Please provide links to example occurrences, if any: In description above.
Anything else we need to know?:
/area prow
/area prow/tide
The text was updated successfully, but these errors were encountered: