Skip to content

Conversation

@kayousterhout
Copy link
Contributor

This commit improves the tests that check the case when a
ShuffleMapTask completes successfully on an executor that has
failed. This commit improves the commenting around the existing
test for this, and adds some additional checks to make it more
clear what went wrong if the tests fail (the fact that these
tests are hard to understand came up in the context of @markhamstra's
proposed fix for #16620).

This commit also removes a test that I realized tested exactly
the same functionality.

@markhamstra, I verified that the new version of the test still fails (and
in a more helpful way) for your proposed change for #16620.

This commit improves the tests that check the case when a
ShuffleMapTask completes successfully on an executor that has
failed.  This commit improves the commenting around the existing
test for this, and adds some additional checks to make it more
clear what went wrong if the tests fail (the fact that these
tests are hard to understand came up in the context of apache#16620).

This commit also removes a test that I realized tested exactly
the same functionality.
@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72732 has started for PR 16892 at commit 06b26d3.

@kayousterhout
Copy link
Contributor Author

cc @mateiz, whose test I deleted / rolled into the existing one

@kayousterhout
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72744 has finished for PR 16892 at commit 06b26d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup of the existing test looks good, but I don't think we should delete the other one. That is testing directly submitting mapstage jobs. While a lot of the logic is the same, I think its worth its own test (even earlier versions of #16620 had the behavior wrong for these jobs).

@kayousterhout
Copy link
Contributor Author

Ok I added back the other test but improved the commenting there.

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72948 has finished for PR 16892 at commit f8407f2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor Author

Jenkins retest this please (filed https://issues.apache.org/jira/browse/SPARK-19613 for the flaky test)

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #72958 has finished for PR 16892 at commit f8407f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

I left some suggestions for more things that could be done, but I don't think that is necessary for getting this in, as what you have already is definitely an improvment.

// Make sure that the stage that was re-submitted was the ShuffleMapStage (not the reduce
// stage, which shouldn't be run until all of the tasks in the ShuffleMapStage complete on
// alive executors).
assert(taskSets(1).tasks(0).isInstanceOf[ShuffleMapTask])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think its worth adding

assert(taskSets(1).tasks.size === 1)

here, to make sure that only the one task is resubmitted, not both? If it weren't true, the test would fail later on anyway, but it might be helpful to get a more meaningful earlier error msg. Not necessary, up to you on whether its worth adding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea done

* Most of the functionality in this test is tested in "run trivial shuffle with out-of-band
* executor failure and retry". However, that test uses ShuffleMapStages that are followed by
* a ResultStage, whereas in this test, the ShuffleMapStage is tested in isolation, without a
* ResultStage after it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have looked closer at this test earlier ... I was hoping this was testing a multi-stage mapjob. That would really be a better test. ideally we'd even have three stages, with a failure happening in the second stage, and the last stage.

In any case, your changes still look good, no need to have to do those other things now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original point of this test was to test a Map-only stage, so then it can't have a stage that follows it, right? I thought your earlier comment ("That is testing directly submitting mapstage jobs. While a lot of the logic is the same, I think its worth its own test (even earlier versions of #16620 had the behavior wrong for these jobs)." was saying that it was important / useful to have this map-only test. Let me know if that comment was based on the understanding that this tested multi-stage jobs and you think I should just remove this map-only test.

I do agree that it would be useful to add another test that tests a job with more stages, which seems like it could reveal more bugs, but I'll hold off on doing that in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the final map-stage can't have anything that follows it, but the job overall can still have multiple stages, and the failure can occur during the processing of those earlier map-stages, or the final one.

In any case, I agree you don't need to expand that test in this PR. and even though this test doesn't do as much as I was hoping, I do still think it adds value, and is worth leaving in, even though its very similar to the other test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that makes sense.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73363 has finished for PR 16892 at commit 1407b48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor Author

I merged this into master -- thanks for the review @squito.

@asfgit asfgit closed this in 5cbd3b5 Feb 24, 2017
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
This commit improves the tests that check the case when a
ShuffleMapTask completes successfully on an executor that has
failed.  This commit improves the commenting around the existing
test for this, and adds some additional checks to make it more
clear what went wrong if the tests fail (the fact that these
tests are hard to understand came up in the context of markhamstra's
proposed fix for apache#16620).

This commit also removes a test that I realized tested exactly
the same functionality.

markhamstra, I verified that the new version of the test still fails (and
in a more helpful way) for your proposed change for apache#16620.

Author: Kay Ousterhout <[email protected]>

Closes apache#16892 from kayousterhout/SPARK-19560.
@kayousterhout kayousterhout deleted the SPARK-19560 branch April 11, 2017 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants