Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented Sep 1, 2015

This is a basic framework for testing the entire scheduler. The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #41877 has finished for PR 8559 at commit 844bd33.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Sep 1, 2015

Jenkins, retest this please

@@ -1123,8 +1133,15 @@ class DAGScheduler(
// TODO: Cancel running tasks in the stage
logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
s"$failedStage (${failedStage.name}) due to fetch failure")
// We might get lots of fetch failed for this stage, from lots of executors.
// Its better if we can resubmit for all the failed executors at one time, so lets
// just wait a *bit* before we resubmit.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea if this comment is accurate or not -- I just found this pretty confusing and felt it deserved some comment, so I took my best guess. The tests in DAGSchedulerSuite pass if you just post this event directly (with other corresponding changes to go along with it ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #8560 to show that you can remove the messageScheduler completely (the tests pass, anyway ...)

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #41878 has finished for PR 8559 at commit 844bd33.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito squito force-pushed the SPARK-10372-scheduler-integs branch from 844bd33 to 093a643 Compare October 20, 2015 16:08
@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43989 has finished for PR 8559 at commit 093a643.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45102 has finished for PR 8559 at commit 609b61e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45448 has finished for PR 8559 at commit 609b61e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47164 has finished for PR 8559 at commit d61b13b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito squito force-pushed the SPARK-10372-scheduler-integs branch from d61b13b to 50d9162 Compare May 10, 2016 20:32
@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58266 has finished for PR 8559 at commit 50d9162.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58275 has finished for PR 8559 at commit 46ebdb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito squito force-pushed the SPARK-10372-scheduler-integs branch from 46ebdb6 to 6e8562f Compare May 13, 2016 19:08
@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58590 has finished for PR 8559 at commit 6e8562f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BasicSchedulerIntegrationSuite extends SchedulerIntegrationSuite[SingleCoreMockBackend]
    • class MultiExecutorBackend(
    • class ExecutorTaskStatus(val host: String, val executorId: String, var freeCores: Int)

@squito squito force-pushed the SPARK-10372-scheduler-integs branch from 6e8562f to 0b6da39 Compare May 13, 2016 22:13
@SparkQA
Copy link

SparkQA commented May 14, 2016

Test build #58593 has finished for PR 8559 at commit 0b6da39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito squito force-pushed the SPARK-10372-scheduler-integs branch from 0b6da39 to c091187 Compare May 17, 2016 15:58
@SparkQA
Copy link

SparkQA commented May 17, 2016

Test build #58704 has finished for PR 8559 at commit c091187.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented May 17, 2016

@mateiz @markhamstra @kayousterhout I've updated this to avoid changing the scheduler at all, and instead just run the mock backend in another thread. It slightly complicates writing the tests, but avoids a (small) race in the tests and was probably necessary when we got to more complicated tests in any case.

In particular I want this for more tests on blacklisting (SPARK-8426), but I think there are a handful of open scheduler issues which will benefit from it.

@SparkQA
Copy link

SparkQA commented May 17, 2016

Test build #58712 has finished for PR 8559 at commit 3b67b2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented May 18, 2016

sorry one more update -- I realized that some of the abstractions were too complicated for the simple tests I had, they were a step towards some more complex tests involving multiple executors. I decided to just go ahead and add an example of those tests as well (which actually required some other minor changes in any case).

I'm sure there will be more to add to the framework at some point, but I think this is a good start for letting us add more tests.

@squito
Copy link
Contributor Author

squito commented May 20, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented May 21, 2016

Test build #59038 has finished for PR 8559 at commit 67acce9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val numUpdates = in.readInt
if (numUpdates == 0) {
accumUpdates = null
accumUpdates = Seq()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised that we were getting away with that, given the number of places that are using DirectTaskResult#accumUpdates without checking for null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, likewise. I guess in all real spark jobs, the number of accum updates is > 0, which is why this didn't particularly matter except for these mocks. OTOH if this has any real danger we should just fix this separately (and be sure to include it in 2.0)

@squito
Copy link
Contributor Author

squito commented May 23, 2016

@markhamstra thanks for the review. I went ahead and added one more commit and removed DummyExternalClusterManager, just used Mock in the old tests. Honestly I kinda prefer having ExternalClusterManagerSuite being very simple and self-contained, but also don't feel particularly strongly about it. (easy to back out since its just one final commit.)

@markhamstra
Copy link
Contributor

@squito I don't feel really strongly about it, either. My only concern was for others adding tests needing a mock ExternalClusterManager in the future and not knowing which one to use, why they are different, whether any new mocking needs to be added to both, etc.

If someone does feel strongly about maintaining the separation, then we can put things back the way you had them.

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59130 has finished for PR 8559 at commit 1fdf2aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented May 24, 2016

@markhamstra alright, in that case do you have any objections if I merge this, one commit back? I'll throw in a comment on DummyExternalClusterManager pointing to MockExternalClusterManager, as I think most folks writing tests would rather start from there.

@markhamstra
Copy link
Contributor

@squito Yeah, that's fine. I haven't gone through the new tests closely to make sure that they are doing what they say they are doing, but the changes to both non-test code and previous tests look safe.

@squito squito force-pushed the SPARK-10372-scheduler-integs branch from 1fdf2aa to 278122e Compare May 24, 2016 22:04
@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59226 has finished for PR 8559 at commit 278122e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in dfc9fc0 May 26, 2016
@rxin
Copy link
Contributor

rxin commented May 26, 2016

This is pretty cool!

@rxin
Copy link
Contributor

rxin commented May 26, 2016

on a related note, @squito can you in the future leave a msg indicating the branch a pr was merged once you merge it? There have been cases that lead to race conditions in merging and also mistakes in the branches that we needed to go back and audit.

@squito
Copy link
Contributor Author

squito commented Jun 1, 2016

sure thing, sorry about that @rxin ! fwiw, this was just merged to master.

@rxin
Copy link
Contributor

rxin commented Jun 1, 2016

np - a lot of people forget to do this. I am just bugging everybody :)

@rxin
Copy link
Contributor

rxin commented Jun 1, 2016

btw i've seen a few flaky tests introduced by this one. If we see more of them we might need to disable and increase timeouts.

@squito
Copy link
Contributor Author

squito commented Jun 1, 2016

Oh thanks for pointing out the flaky test, I think I figured out the problem, opened https://issues.apache.org/jira/browse/SPARK-15714. Weird I never this running locally -- I'd expect this race to be pretty common. Just tried and an on my laptop it only shows up once in 100 runs.

@skonto
Copy link
Contributor

skonto commented Jun 3, 2016

About the flaky test (checked locally on master through my IDE its easily reproducible):

`Map(8 -> 42, 7 -> 42) was not empty
ScalaTestFailureLocation: org.apache.spark.scheduler.BlacklistIntegrationSuite$$anonfun$1 at (BlacklistIntegrationSuite.scala:51)
org.scalatest.exceptions.TestFailedException: Map(8 -> 42, 7 -> 42) was not empty
at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at org.apache.spark.scheduler.BlacklistIntegrationSuite$$anonfun$1.apply$mcV$sp(BlacklistIntegrationSuite.scala:51)
at org.apache.spark.scheduler.SchedulerIntegrationSuite$$anonfun$testScheduler$1.apply$mcV$sp(SchedulerIntegrationSuite.scala:88)
at org.apache.spark.scheduler.SchedulerIntegrationSuite$$anonfun$testScheduler$1.apply(SchedulerIntegrationSuite.scala:84)
at org.apache.spark.scheduler.SchedulerIntegrationSuite$$anonfun$testScheduler$1.apply(SchedulerIntegrationSuite.scala:84)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at

....
Also check here:
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59928/

@squito
Copy link
Contributor Author

squito commented Jun 3, 2016

@skonto I just merged #13454, please let me know if that does not address the issue.

@skonto
Copy link
Contributor

skonto commented Jun 3, 2016

Locally i get now:

Futures timed out after [1 second]
java.util.concurrent.TimeoutException: Futures timed out after [1 second]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.ready(package.scala:169)
at org.apache.spark.scheduler.BlacklistIntegrationSuite$$anonfun$3$$anonfun$apply$mcV$sp$6.apply(BlacklistIntegrationSuite.scala:90)
at org.apache.spark.scheduler.BlacklistIntegrationSuite$$anonfun$3$$anonfun$apply$mcV$sp$6.apply(BlacklistIntegrationSuite.scala:87)
at org.apache.spark.scheduler.SchedulerIntegrationSuite.withBackend(SchedulerIntegrationSuite.scala:224)

@squito
Copy link
Contributor Author

squito commented Jun 3, 2016

@skonto since you seem to be able to reproduce this regularly though I can't -- can you try just increasing the timeouts? honestly I'm surprised you're hitting them, but if we just need to bump up the timeouts a little more, we can do that. If there is something else wrong, eg. another race, than lets turn these tests off for now till I can nail it down.

@skonto
Copy link
Contributor

skonto commented Jun 6, 2016

@squito not really check here:
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60037/
Tests should be stable not just a local problem btw. I dont really know why the test takes so long but needs investigation. I suggest disable that test for now or try it in a diff env, i may check it as well locally when time permits. For now i would like to have a successful build for the pr.

@squito
Copy link
Contributor Author

squito commented Jun 6, 2016

@skonto agreed, I have opened https://issues.apache.org/jira/browse/SPARK-15783 + PR #13528 to turn them off for now.

I was at least able to reproduce the failures with 5k runs on my laptop, (also seems worse when I run all the different tests together vs just running one test repeatedly) so I will try to nail down the cause more before adding back.

@squito
Copy link
Contributor Author

squito commented Jun 6, 2016

@skonto I just merged #13528 this to turn the tests off for now. Sorry again about the issues, and I'll work on figuring out the cause of the flakiness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants