[SPARK-19560] Improve DAGScheduler tests. #16892

kayousterhout · 2017-02-11T06:16:00Z

This commit improves the tests that check the case when a
ShuffleMapTask completes successfully on an executor that has
failed. This commit improves the commenting around the existing
test for this, and adds some additional checks to make it more
clear what went wrong if the tests fail (the fact that these
tests are hard to understand came up in the context of @markhamstra's
proposed fix for #16620).

This commit also removes a test that I realized tested exactly
the same functionality.

@markhamstra, I verified that the new version of the test still fails (and
in a more helpful way) for your proposed change for #16620.

This commit improves the tests that check the case when a ShuffleMapTask completes successfully on an executor that has failed. This commit improves the commenting around the existing test for this, and adds some additional checks to make it more clear what went wrong if the tests fail (the fact that these tests are hard to understand came up in the context of apache#16620). This commit also removes a test that I realized tested exactly the same functionality.

SparkQA · 2017-02-11T06:18:28Z

Test build #72732 has started for PR 16892 at commit 06b26d3.

kayousterhout · 2017-02-11T06:39:31Z

cc @mateiz, whose test I deleted / rolled into the existing one

kayousterhout · 2017-02-12T00:09:01Z

Jenkins retest this please

SparkQA · 2017-02-12T02:46:27Z

Test build #72744 has finished for PR 16892 at commit 06b26d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

The cleanup of the existing test looks good, but I don't think we should delete the other one. That is testing directly submitting mapstage jobs. While a lot of the logic is the same, I think its worth its own test (even earlier versions of #16620 had the behavior wrong for these jobs).

kayousterhout · 2017-02-15T19:45:57Z

Ok I added back the other test but improved the commenting there.

SparkQA · 2017-02-15T21:52:23Z

Test build #72948 has finished for PR 16892 at commit f8407f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-02-15T22:04:38Z

Jenkins retest this please (filed https://issues.apache.org/jira/browse/SPARK-19613 for the flaky test)

SparkQA · 2017-02-16T00:48:34Z

Test build #72958 has finished for PR 16892 at commit f8407f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

lgtm

I left some suggestions for more things that could be done, but I don't think that is necessary for getting this in, as what you have already is definitely an improvment.

squito · 2017-02-21T18:32:38Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+    // Make sure that the stage that was re-submitted was the ShuffleMapStage (not the reduce
+    // stage, which shouldn't be run until all of the tasks in the ShuffleMapStage complete on
+    // alive executors).
+    assert(taskSets(1).tasks(0).isInstanceOf[ShuffleMapTask])


do you think its worth adding

assert(taskSets(1).tasks.size === 1)

here, to make sure that only the one task is resubmitted, not both? If it weren't true, the test would fail later on anyway, but it might be helpful to get a more meaningful earlier error msg. Not necessary, up to you on whether its worth adding.

Good idea done

squito · 2017-02-21T18:35:26Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+   * Most of the functionality in this test is tested in "run trivial shuffle with out-of-band
+   * executor failure and retry".  However, that test uses ShuffleMapStages that are followed by
+   * a ResultStage, whereas in this test, the ShuffleMapStage is tested in isolation, without a
+   * ResultStage after it.


I should have looked closer at this test earlier ... I was hoping this was testing a multi-stage mapjob. That would really be a better test. ideally we'd even have three stages, with a failure happening in the second stage, and the last stage.

In any case, your changes still look good, no need to have to do those other things now.

The original point of this test was to test a Map-only stage, so then it can't have a stage that follows it, right? I thought your earlier comment ("That is testing directly submitting mapstage jobs. While a lot of the logic is the same, I think its worth its own test (even earlier versions of #16620 had the behavior wrong for these jobs)." was saying that it was important / useful to have this map-only test. Let me know if that comment was based on the understanding that this tested multi-stage jobs and you think I should just remove this map-only test.

I do agree that it would be useful to add another test that tests a job with more stages, which seems like it could reveal more bugs, but I'll hold off on doing that in this PR.

the final map-stage can't have anything that follows it, but the job overall can still have multiple stages, and the failure can occur during the processing of those earlier map-stages, or the final one.

In any case, I agree you don't need to expand that test in this PR. and even though this test doesn't do as much as I was hoping, I do still think it adds value, and is worth leaving in, even though its very similar to the other test.

I see that makes sense.

SparkQA · 2017-02-24T00:00:12Z

Test build #73363 has finished for PR 16892 at commit 1407b48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-02-24T19:43:34Z

I merged this into master -- thanks for the review @squito.

This commit improves the tests that check the case when a ShuffleMapTask completes successfully on an executor that has failed. This commit improves the commenting around the existing test for this, and adds some additional checks to make it more clear what went wrong if the tests fail (the fact that these tests are hard to understand came up in the context of markhamstra's proposed fix for apache#16620). This commit also removes a test that I realized tested exactly the same functionality. markhamstra, I verified that the new version of the test still fails (and in a more helpful way) for your proposed change for apache#16620. Author: Kay Ousterhout <[email protected]> Closes apache#16892 from kayousterhout/SPARK-19560.

kayousterhout mentioned this pull request Feb 11, 2017

[SPARK-19263] DAGScheduler should avoid sending conflicting task set. #16620

Closed

squito suggested changes Feb 14, 2017

View reviewed changes

kayousterhout added 2 commits February 15, 2017 11:16

Added back removed test

ed169aa

Improved test commenting

f8407f2

squito approved these changes Feb 21, 2017

View reviewed changes

Added another check as per Imran's comment

1407b48

asfgit closed this in 5cbd3b5 Feb 24, 2017

kayousterhout deleted the SPARK-19560 branch April 11, 2017 22:02

[SPARK-19560] Improve DAGScheduler tests. #16892

[SPARK-19560] Improve DAGScheduler tests. #16892

Uh oh!

Conversation

kayousterhout commented Feb 11, 2017

Uh oh!

SparkQA commented Feb 11, 2017

Uh oh!

kayousterhout commented Feb 11, 2017

Uh oh!

kayousterhout commented Feb 12, 2017

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

kayousterhout commented Feb 15, 2017

Uh oh!

SparkQA commented Feb 15, 2017

Uh oh!

kayousterhout commented Feb 15, 2017

Uh oh!

SparkQA commented Feb 16, 2017

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

squito Feb 21, 2017

Choose a reason for hiding this comment

Uh oh!

kayousterhout Feb 23, 2017

Choose a reason for hiding this comment

Uh oh!

squito Feb 21, 2017

Choose a reason for hiding this comment

Uh oh!

kayousterhout Feb 23, 2017

Choose a reason for hiding this comment

Uh oh!

squito Feb 24, 2017

Choose a reason for hiding this comment

Uh oh!

kayousterhout Feb 24, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

kayousterhout commented Feb 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants