[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite#13565
[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite#13565squito wants to merge 10 commits into
Conversation
…ests so ignore for now" This reverts commit 36d3dfa.
| class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorMockBackend]{ | ||
|
|
||
| val badHost = "host-0" | ||
| val duration = Duration(10, SECONDS) |
There was a problem hiding this comment.
Pretty sure that such a long duration isn't really necessary, but I don't think it hurts to make it longer just in case.
|
@skonto since you seemed to be able to trigger the problems very reliably, do you mind giving this a spin and seeing if it works for you? :) |
|
Test build #60188 has finished for PR 13565 at commit
|
|
Jenkins, retest this please |
|
Test build #60199 has finished for PR 13565 at commit
|
|
LGTM but let's see what Stavros says. |
|
Test build #3071 has finished for PR 13565 at commit
|
|
Test build #3072 has finished for PR 13565 at commit
|
|
Test build #60548 has finished for PR 13565 at commit
|
|
tests seem relatively stable now, and this passes regularly for me, so I'm going to merge it and keep an eye on builds. |
|
merged to master |
What changes were proposed in this pull request?
Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite
assertEmptyDataStructureswould occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.DAGSchedulerSuitewas not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too.assertMapOutputAvailableis not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.How was this patch tested?
I ran all the tests in
BlacklistIntegrationSuite5k times and everything inDAGSchedulerSuite1k times on my laptop. Also I ran a full jenkins build withBlacklistIntegrationSuite500 times andDAGSchedulerSuite50 times, see #13548. (I tried more times but jenkins timed out.)To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.)
And I'll run Jenkins on this a couple of times to do one more check.