-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20115] [CORE] Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable #17445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…l shuffle service is unavailable on that host cr https://cr.amazon.com/r/6822886/
|
Jenkins this is OK to test |
|
Have you seen #17088? I just glanced at this quickly but I think this is a duplicate of that (SPARK-19753) |
|
@kayousterhout Thanks for your response, and for that link. Well it does seem like #17088 addresses the same issue as this PR. However, I would like the you all to review this PR as well, because I think it more clearly organizes the code between handling of internal and external shuffle failures. It also removes a lot of the code duplication which is part of the other PR. Further, it adds an epoch check for the 'host'. |
|
Jenkins test this please. |
|
Jenkins this is ok to test |
|
Test build #75513 has finished for PR 17445 at commit
|
|
@kayousterhout @mridulm @rxin @lins05 Can you take a look at this PR ? |
|
@kayousterhout @mridulm @rxin @lins05 @markhamstra @tgravescs @squito Can you take a look at this ? |
|
there is a large discussion about how to handle fetch failures going on in https://issues.apache.org/jira/browse/SPARK-20178. The fact that you got a fetch failure does not mean that all blocks are invalid or that the external shuffle service is totally down. It could very well be an intermittent thing as well. There was also a pr to make the stage attempts configurable so you could increase that. If a lot of people are seeing this issue the question is do we need to do something shorter term to handle this well we are discussing SPARK-20178. Certainly if we are seeing more actual job failures due to it, it would be better to invalidate all the output and it possibly runs longer but at least it doesn't fail. |
|
Now that #18150 has been merged, maybe we can close this now? |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.
What changes were proposed in this pull request?
The Spark’s DAGScheduler currently does not recompute all the lost shuffle blocks on a host when a FetchFailed exception occurs, while fetching shuffle blocks from another executor with external shuffle service enabled. Instead it only recomputes the lost shuffle blocks computed by the executor for which the FetchFailed exception occurred. This works fine for Internal shuffle scenario, where the executors serve their own shuffle blocks and hence only the shuffle blocks for that executor should be considered lost. However, when External Shuffle Service is being used, a FetchFailed exception would mean that the external shuffle service running on that host has become unavailable. This in turn is sufficient to assume that all the shuffle blocks which were managed by the Shuffle service on that host are lost. Therefore, just recomputing the shuffle blocks associated with the particular Executor for which FetchFailed exception occurred is not sufficient. We need to recompute all the shuffle blocks, managed by that service because there could be multiple executors running on that host.
Since not all the shuffle blocks (for all the executors on the host) are recomputed, this causes future attempts of the reduce stage to fail as well because the new tasks scheduled still keep trying to reach the old location of the shuffle blocks (which were not recomputed) and keep throwing further FetchFailed exceptions. This ultimately causes the job to fail, after the reduce stage has been retried 4 times.
Following changes are proposed to address the above issue:
How was this patch tested?
@kayousterhout @mridulm @rxin