-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19560] Improve DAGScheduler tests. #16892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit improves the tests that check the case when a ShuffleMapTask completes successfully on an executor that has failed. This commit improves the commenting around the existing test for this, and adds some additional checks to make it more clear what went wrong if the tests fail (the fact that these tests are hard to understand came up in the context of apache#16620). This commit also removes a test that I realized tested exactly the same functionality.
|
Test build #72732 has started for PR 16892 at commit |
|
cc @mateiz, whose test I deleted / rolled into the existing one |
|
Jenkins retest this please |
|
Test build #72744 has finished for PR 16892 at commit
|
squito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup of the existing test looks good, but I don't think we should delete the other one. That is testing directly submitting mapstage jobs. While a lot of the logic is the same, I think its worth its own test (even earlier versions of #16620 had the behavior wrong for these jobs).
|
Ok I added back the other test but improved the commenting there. |
|
Test build #72948 has finished for PR 16892 at commit
|
|
Jenkins retest this please (filed https://issues.apache.org/jira/browse/SPARK-19613 for the flaky test) |
|
Test build #72958 has finished for PR 16892 at commit
|
squito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I left some suggestions for more things that could be done, but I don't think that is necessary for getting this in, as what you have already is definitely an improvment.
| // Make sure that the stage that was re-submitted was the ShuffleMapStage (not the reduce | ||
| // stage, which shouldn't be run until all of the tasks in the ShuffleMapStage complete on | ||
| // alive executors). | ||
| assert(taskSets(1).tasks(0).isInstanceOf[ShuffleMapTask]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think its worth adding
assert(taskSets(1).tasks.size === 1)here, to make sure that only the one task is resubmitted, not both? If it weren't true, the test would fail later on anyway, but it might be helpful to get a more meaningful earlier error msg. Not necessary, up to you on whether its worth adding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea done
| * Most of the functionality in this test is tested in "run trivial shuffle with out-of-band | ||
| * executor failure and retry". However, that test uses ShuffleMapStages that are followed by | ||
| * a ResultStage, whereas in this test, the ShuffleMapStage is tested in isolation, without a | ||
| * ResultStage after it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have looked closer at this test earlier ... I was hoping this was testing a multi-stage mapjob. That would really be a better test. ideally we'd even have three stages, with a failure happening in the second stage, and the last stage.
In any case, your changes still look good, no need to have to do those other things now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original point of this test was to test a Map-only stage, so then it can't have a stage that follows it, right? I thought your earlier comment ("That is testing directly submitting mapstage jobs. While a lot of the logic is the same, I think its worth its own test (even earlier versions of #16620 had the behavior wrong for these jobs)." was saying that it was important / useful to have this map-only test. Let me know if that comment was based on the understanding that this tested multi-stage jobs and you think I should just remove this map-only test.
I do agree that it would be useful to add another test that tests a job with more stages, which seems like it could reveal more bugs, but I'll hold off on doing that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the final map-stage can't have anything that follows it, but the job overall can still have multiple stages, and the failure can occur during the processing of those earlier map-stages, or the final one.
In any case, I agree you don't need to expand that test in this PR. and even though this test doesn't do as much as I was hoping, I do still think it adds value, and is worth leaving in, even though its very similar to the other test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that makes sense.
|
Test build #73363 has finished for PR 16892 at commit
|
|
I merged this into master -- thanks for the review @squito. |
This commit improves the tests that check the case when a ShuffleMapTask completes successfully on an executor that has failed. This commit improves the commenting around the existing test for this, and adds some additional checks to make it more clear what went wrong if the tests fail (the fact that these tests are hard to understand came up in the context of markhamstra's proposed fix for apache#16620). This commit also removes a test that I realized tested exactly the same functionality. markhamstra, I verified that the new version of the test still fails (and in a more helpful way) for your proposed change for apache#16620. Author: Kay Ousterhout <[email protected]> Closes apache#16892 from kayousterhout/SPARK-19560.
This commit improves the tests that check the case when a
ShuffleMapTask completes successfully on an executor that has
failed. This commit improves the commenting around the existing
test for this, and adds some additional checks to make it more
clear what went wrong if the tests fail (the fact that these
tests are hard to understand came up in the context of @markhamstra's
proposed fix for #16620).
This commit also removes a test that I realized tested exactly
the same functionality.
@markhamstra, I verified that the new version of the test still fails (and
in a more helpful way) for your proposed change for #16620.