Skip to content

[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite#13565

Closed
squito wants to merge 10 commits into
apache:masterfrom
squito:blacklist_extra_tests
Closed

[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite#13565
squito wants to merge 10 commits into
apache:masterfrom
squito:blacklist_extra_tests

Conversation

@squito

@squito squito commented Jun 8, 2016

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite

  1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in. So added in the periodic revive offers, just like the real scheduler.
  2. assertEmptyDataStructures would occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.
  3. DAGSchedulerSuite was not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too.
  4. Turns out that assertMapOutputAvailable is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.
  5. When there is an exception in the backend, try to improve the error msg a little bit. Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything.

How was this patch tested?

I ran all the tests in BlacklistIntegrationSuite 5k times and everything in DAGSchedulerSuite 1k times on my laptop. Also I ran a full jenkins build with BlacklistIntegrationSuite 500 times and DAGSchedulerSuite 50 times, see #13548. (I tried more times but jenkins timed out.)

To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.)

And I'll run Jenkins on this a couple of times to do one more check.

class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorMockBackend]{

val badHost = "host-0"
val duration = Duration(10, SECONDS)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure that such a long duration isn't really necessary, but I don't think it hurts to make it longer just in case.

@squito

squito commented Jun 8, 2016

Copy link
Copy Markdown
Contributor Author

@skonto since you seemed to be able to trigger the problems very reliably, do you mind giving this a spin and seeing if it works for you? :)

@SparkQA

SparkQA commented Jun 8, 2016

Copy link
Copy Markdown

Test build #60188 has finished for PR 13565 at commit 174d070.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito

squito commented Jun 8, 2016

Copy link
Copy Markdown
Contributor Author

Jenkins, retest this please

@SparkQA

SparkQA commented Jun 8, 2016

Copy link
Copy Markdown

Test build #60199 has finished for PR 13565 at commit 174d070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin

vanzin commented Jun 8, 2016

Copy link
Copy Markdown
Contributor

LGTM but let's see what Stavros says.

@SparkQA

SparkQA commented Jun 9, 2016

Copy link
Copy Markdown

Test build #3071 has finished for PR 13565 at commit 174d070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jun 9, 2016

Copy link
Copy Markdown

Test build #3072 has finished for PR 13565 at commit 174d070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jun 15, 2016

Copy link
Copy Markdown

Test build #60548 has finished for PR 13565 at commit e35a7d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito

squito commented Jun 22, 2016

Copy link
Copy Markdown
Contributor Author

tests seem relatively stable now, and this passes regularly for me, so I'm going to merge it and keep an eye on builds.

@asfgit asfgit closed this in cf1995a Jun 22, 2016
@squito

squito commented Jun 22, 2016

Copy link
Copy Markdown
Contributor Author

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants