-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23243][Core] Fix RDD.repartition() data correctness issue #21698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92526 has finished for PR 21698 at commit
|
|
retest this please |
|
Test build #92527 has finished for PR 21698 at commit
|
| new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), | ||
| new HashPartitioner(numPartitions)), | ||
| new ShuffledRDD[Int, T, T]( | ||
| mapPartitionsWithIndex(distributePartition, recomputeOnFailure), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid changing the existing mapPartitionsWithIndex, we can create MapPartitionsRDD directly here.
|
|
||
| // include a shuffle step so that our upstream tasks are still distributed | ||
| val recomputeOnFailure = | ||
| conf.getBoolean("spark.shuffle.recomputeAllPartitionsOnRepartitionFailure", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put it in object config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without sorting, it doesn't make sense to have this config: disabling it means users will get wrong result.
| shuffleStatus.removeOutputsByFilter(x => true) | ||
| incrementEpoch() | ||
| case None => | ||
| throw new SparkException("unregisterMapOutput called for nonexistent shuffle ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unregisterMapOutput -> unregisterAllMapOutput.
|
Thanks @cloud-fan @viirya comments addressed :) |
|
Test build #92616 has finished for PR 21698 at commit
|
| // Mark the map whose fetch failed as broken in the map stage | ||
| if (mapId != -1) { | ||
|
|
||
| if (mapStage.rdd.recomputeAllPartitionsOnFailure()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, what if we have a map after repartition? then the root RDD will return false on recomputeAllPartitionsOnFailure
| * multiple tasks from the same stage attempt fail (SPARK-5945). | ||
| */ | ||
| val fetchFailedAttemptIds = new HashSet[Int] | ||
| val failedAttemptIds = new HashSet[Int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why rename it? we only increase it on fetch failure, don't we?
|
I did not go over the PR itself in detail, but the proposal sounds very expensive - particularly given the cascading costs involved. Also, I am not sure why we are special case'ing only coalasce/repartition here : any closure which is depending on ordering of tuples is bound to fail - for example, RDD.zip* variants, sampling in ML etc will suffer from same issue. Either we fix shuffle itself to become deterministic (which I am not sure if we can do efficiently), or we could simply document this issue with coalasce/other relevant api - so that users do a sort when applicable : when they deem the additional cost is required to be borne. |
|
Thank you for your comments @mridulm ! Since you are also deeply involved in the related discussion on the correctness issue caused by non-deterministic input for shuffle, you may also agree that there is actually no easy way to both guarantee correctness and don't cause serious performance drop-off. I have to insist that correctness always goes beyond performance concerns, and that we shall not expect users to always remember they may hit a known correctness bug in case of some use patterns. As for the proposed solution, there are actually two ways to follow: Either you insert a local sort before a shuffle repartition (that's how we fixed the DataFrame.repartition()), or you always retry the whole stage with repartition on FetchFailure. The problem with the local-sort solution is that, it can't fix all the problems for RDD (since the data type of an RDD can be not sortable, and it's hard to construct a sorting for a generic type), also it can make the time consumption of repartition() increases by 3X ~ 5X. By applying the approach proposed in this PR, the performance shall keep the same in case no FetchFailure happens, and it shall works well for DataFrames as well as for RDDs. I have to admit that if you have a big query running on a huge cluster, and the tasks can easily hit FetchFailure issues, then you may see the job takes more time to finish (or even fall due to reach max consequence stage failure limit). But again, your big query may be producing wrong result without a patch, and I have to say that is even more unacceptable :( . As for the cascading cost, you are right, it makes things worse, and I don't have good advice for that issue. |
|
@jiangxb1987 Any closure sensitive to iteration order [1] is effected by this - under the set of circumstances. The proposed solution will cause cascading failures for non trivial applications (chain of shuffles) - and also introduce high cost - and can unfortunately cause application failures and unpredictable SLA's. Having said that, if this is an attempt to explore solutions, I am all for it ! I assumed this was a proposal to get it merged. [1] I gave example of zip* and sampling, but really - any user defined closure is affected; and we cannot special case for all of them. |
|
IMO RDD as a distributed data set, it should not guarantee any record order unless you sort it. So user functions and Spark internal functions should not expect a specific record order. However, the round robin partitione(following with a shuffle) violates it. If the record order changes during retry, we may get wrong result. That's why we should fix I agree with @mridulm that this may introduce a big perf penalty. But when a repartition task fails, we should pay the cost to get the corrected result, instead of producing wrong result and asking users to deal with it themselves. I feel this is a better solution than the sort one. We should only pay the cost when we really need to, i.e. when the repartition task fails and it's not the final stage. |
|
@cloud-fan The difference would be between a (user) defined record order (global sort or local sort) and expectation of repeatable record order on recomputation.
This is is not limited to repartition : any closure which depends on input order has the same effect - repartition/coalesce is one instance of this issue - I gave a few examples from spark itself; and I am sure there are other examples from spark and user code. It is possible this issue was initially identified via repartition - but modeling the solution only for one manifestation of the issue ignores all others and leaves them unfixed. |
|
For
|
|
@cloud-fan We should not look at a particular stage in isolation, but rather what happens when there are failures in the middle of a job with multiple shuffle stages - and zip is one of the internal stages. If relative ordering of rdd1 or rdd2 changes, the computed key would change - and we end up with data loss if some of the tasks in save have already completed. |
The above example may create some different output when retrying a subset of all the tasks. But I may not call it a data loss or data correctness issue. Let's image you run the query twice, each with different ordering of Now I tend to believe there will not be data loss or data correctness issue, as long as you don't spread input data across partitions in a round robin way (or, in a way that is not related to the data itself), because on task retry you are guaranteed that all input data are covered (each row get recomputed exactly once, though maybe in different order). |
|
@jiangxb1987 Different number of output rows is due to data loss - it is not another valid run. |
|
@mridulm you provided a good example to show the indeterminacy of If On the other hand, Basically user builds an RDD DAG and Spark should produce a result to meet user's expectation. For That's why we should fix |
|
@cloud-fan There is no ambiguity in output of map - one record in, one record out. Note that the problem I surfaced above will cause data loss even after the proposed fix in this PR by @jiangxb1987 btw. |
yes, but the following |
|
IIUC the output produced by |
|
(Editing my previous response - not well thought out given it is late night :) ) |
|
@jiangxb1987 data loss comes because a re-execution of zip might generate a key for which corresponding reducer has already finished. |
|
OK we can treat it as a data loss. However, it's not caused by spark but by the user himself. If a user calls
Anyway if we do want to "fix" the |
|
Taking a step back and analyzing the solution for the problem at hand.
The corner case where the proposal works is if : This is a fairly narrow subset, and why I dont believe the current approach helps. |
|
if you are looking at recomputing how are you going to handle if some tasks have already written output? This was brought up by @cloud-fan above and I didn't see a response. Some output formats have a task commit and then a job commit so it may work for those, but others might not have that.
I really disagree with this. We need to fix this in some way before 2.4 release. If the sort way is a fix but performance regression we should do that as its at least fixed by default. We have the config for people who are ok with possible corruption and just want the performance. I wouldn't think its any worse then what is there for dataframes based on what you have said. did anyone run benchmarks on the fix for dataframes? I'm really curious what the real performance implications are. Note that Apache PIG also had a similar issue withe the round robin partitioner and they removed it and used a hash value partitioner. Spark is obviously different but the underlying issue is the same. I would actually prefer to see us just use the hash partitioner if we can't find a better solution. Our official docs I don't think says it repartitions evenly (http://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.rdd.RDD), but our programming guide does:
So I could see the argument that says we can't change that behavior. |
I really agree with Tom. At this point, I think the working assumption should be that any 2.4.0 release candidate that doesn't deliver some fix for this issue will receive multiple -1 votes from PMC members. |
|
@jiangxb1987 can you clarify if you are working on this still or if you won't have time for a bit? @mridulm @zsxwing @cloud-fan thoughts on @squito approach to "you can at least sort the serialized bytes of T" |
I think this should work. |
|
@tgravescs I'm still working on this but I would be glad if you can also work on the "sort the serialized bytes of T" approach, actually the retry-all-tasks approach seems more complex than I expected when it involves commit protocol (currently a task can be only committed once, so if you already have some tasks committed and then hit a ExecutorLost then retry-all-tasks won't work), so I hope we can have other approaches like "sort the serialized bytes of T" get merged into 2.4 release. I'll post the benchmark result of DF.repartition() fix later. |
|
I also think @tgravescs solution of using the HashPartitioner is an acceptable one, though as you've noted it doesn't deal w/ skew (which may be a lot of the existing use of |
|
@tgravescs I vaguely remember someone at y! labs telling me (more than a decade back) about MR always doing a sort as part of its shuffle to avoid a variant of this problem by design. [1] https://hadoop.apache.org/docs/r0.23.11/api/src-html/org/apache/hadoop/io/BytesWritable.html |
|
@squito @tgravescs I am probably missing something about why hash partitioner helps, can you please clarify ? Btw, when shuffle = false, it does not suffer from the problem - mentally I had assumed that had an issue too - on a recheck now, I find it interesting that it does not (I never used that, so had never checked in detail !) Also, as I mentioned a few times above ... repartition/coalesce is only one of the public api's in spark which suffers from this; a host of others have same issue as well. |
|
I took a quick look at the shuffle writer and feel it will be hard to insert a sort there.
After more thoughts, the problem can be generalized as
Step 3 is problematic: assuming we have 5 map tasks and 5 reduce tasks, and the input data is random. Let's say reduce task 1,2,3,4 are finished, reduce task 5 failed with That said, I think we are too optimistic when handling fetch failure. We should keep in mind that RDD output can be random. So when we see fetch failure and rerun map tasks, we should track which reducers have its shuffle blocks being rewritten, and rerun them. Simply inserting a sort before shuffle doesn't help. The fix for dataframe is adding a sort before round-robin, to make it deterministic. If we add the sort after round-robin and before shuffle, the problem still exists. I think the correct fix is: be more conservative when handling fetch failure and rerun more reduce tasks. We can provide an internal API to tag a RDD as deterministic (very common in Spark SQL) and then we can safely be optimistic when handling fetch failure. |
|
Thanks @cloud-fan your summary above is super useful, and I think it's clear enough.
IIUC, patterns like
Does this means, if we can generate non-determine output, then we can still loss some data even add a local sort before shuffle, because the reduce tasks may have already finished (or even have committed)?
This is somehow like what I proposed yesterday, one issue we can't resolve is that some ResultTasks may have committed, in that case it seems the best effort we can make is just fail the job. |
@mridulm sorry I wasn't real clear, I guess on the RDD side its not called RoundRobinPartitioner (like it is on dataframe side), but the distributePartition is essentially doing that would need to change to just use a normal hash or something that is deterministic. Basically any operation that does the shuffle has to have deterministic output for choosing which reducers it goes to. The idea at least is what the first PR for this jira was: #20414 (disclaimer, I haven't looked at that in detail). I'll spend more time going through the code to see all the specifics. But like we've discussed the downside is it will not be evenly distributed. So the question is if we would want that?
Thanks @cloud-fan for the write up. This is exactly why I brought up HashPartitioner (ie stop using round robin whether in partitioner or distributePartition function) and exactly why Pig stopped using it for its Union operation. Its not just us internally doing it, the user code could do anything that the output is not in the same order on rerun of the Map Task. Like you said, the sort has to be done before the round robin and it has to be done on the entire record (not just on key for instance if you had key,value), and I do see this as being a possibly very expensive operation. But if for instance we can't change repartition to not be evenly distributed, it seems like one of the only options. I haven't looked at the details about inserting it here either so I need to do that to understand how complicated it would be. If we can't come up with another solution, I would actually be ok with failing short term, its better then corruption. Or perhaps we can actually allow user to choose the behavior, have a config for it would fail, one config for they don't care because they know they checkpointed or something, and one for doing the sort.
Like @jiangxb1987 said and we discussed above I don't think this will work when you have result tasks that could have committed output. You can't undo a task commit. Now we could do a combination of things where as long as we aren't a ResultTask we fail all reducers and maps to rerun. If we are running ResultTask you either fail entire job or fall back to the expensive Sort. Just to kind of summarize, the only solutions that I've thought of or mentioned by others:
|
|
@cloud-fan I think we have to be clear on the boundaries of the solution we can provide in spark.
As stated above, this is something we do not support in spark. Which is why 'randomness' (or rather pseudo-randomness) is seeded based using invariants like partition id which result in same output partition on task re-execution. The problem we have here is : even if user code satisfies this constraint, due to non determinism in input order, the output changes when closure is order sensitive. Given this, analyzing the statement below :
Here - map task 3 and 4 will always produce the same output partition for supported closures - if the input partition remains same. With deterministic input partition - we can see that output of map task 3 and 4 will always be the same - and reduce task input's for 3/4/5 will be the same. So only reduce task 5 will need to be rerun and none of the other input's will change. |
Thanks for clarifying @tgravescs ! I was looking at
If I understand correctly, the proposal is
This looks like a reasonable stop gap until we fix the issue. It also allows for users to make progress by inserting a checkpoint before the order sensitive closure to unblock them. |
Ah this is a reasonable restriction, we should document it in the RDD classdoc. How about the source (root RDD or shuffle)? The output of reduce task is non-deterministic because Spark fetches multiple shuffle blocks at the same time and it's random which shuffle blocks can finish fetching first. External sorter has the same problem: the output order can change if spilling happens. Generally I think there are 3 directions:
|
|
so I think the assumption is that task results are idempotent but not ordered. Sorry if that contradictory. The data itself has to be the same on rerun but the order of things in there doesn't. That was my general assumption. I think zip doesn't follow that though when the inputs aren't ordered. Not sure if there are others spark supports, need to go through the list I guess, unless someone already has? I think we just need to document these operations and say the results can be inconsistent if not sorted or perhaps give them an option to also sort. Either that or we have to say we don't support unordered output at all in Spark. Thoughts on just documenting zip or others with unordered input? I don't think mapreduce and pig have this issue because they don't internally support an operation like zip, everything is on key/values and joins, groupby on the keys. User code there could generate it as well but I would claim its the users fault there. |
|
I tried a prototype to fix the handling of fetch failure, seems not that hard: #22112 |
|
@jiangxb1987 I am guessing we should close this PR ? |
|
Thanks everyone! I closed this in favor of #22112 |
An alternative fix for apache#21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. a new test case Closes apache#22112 from cloud-fan/repartition. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Xiao Li <[email protected]>
backport #22112 to 2.3 ------- An alternative fix for #21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. a new test case Closes #22354 from cloud-fan/repartition. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
backport apache#22112 to 2.2 ------- An alternative fix for apache#21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. a new test case Closes apache#22354 from cloud-fan/repartition. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ectness issue ## What changes were proposed in this pull request? Back port of #22354 and #17955 to 2.2 (#22354 depends on methods introduced by #17955). ------- An alternative fix for #21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. ## How was this patch tested? a new test case Closes #22382 from bersprockets/SPARK-23243-2.2. Lead-authored-by: Bruce Robbins <[email protected]> Co-authored-by: Josh Rosen <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ectness issue ## What changes were proposed in this pull request? Back port of apache#22354 and apache#17955 to 2.2 (apache#22354 depends on methods introduced by apache#17955). ------- An alternative fix for apache#21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. ## How was this patch tested? a new test case Closes apache#22382 from bersprockets/SPARK-23243-2.2. Lead-authored-by: Bruce Robbins <[email protected]> Co-authored-by: Josh Rosen <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ectness issue ## What changes were proposed in this pull request? Back port of apache#22354 and apache#17955 to 2.2 (apache#22354 depends on methods introduced by apache#17955). ------- An alternative fix for apache#21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. ## How was this patch tested? a new test case Closes apache#22382 from bersprockets/SPARK-23243-2.2. Lead-authored-by: Bruce Robbins <[email protected]> Co-authored-by: Josh Rosen <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ectness issue ## What changes were proposed in this pull request? Back port of apache#22354 and apache#17955 to 2.2 (apache#22354 depends on methods introduced by apache#17955). ------- An alternative fix for apache#21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. **future improvement:** 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. ## How was this patch tested? a new test case Closes apache#22382 from bersprockets/SPARK-23243-2.2. Lead-authored-by: Bruce Robbins <[email protected]> Co-authored-by: Josh Rosen <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The RDD repartition uses a round-robin way to distribute data, thus there may be data correctness issue if only a sub-set of partitions are recomputed on fetch failure and the input data sequence is not deterministic.
The RDD data type may be not sortable, so we cannot resolve the whole issue by insert a local sort before shuffle (while we shall still provide that solution as a optional choice for those RDDs with sortable data type). The approach proposed in this PR is to always recompute all the partitions before shuffle on fetch failure, thus we don't rely on certain input data sequence.
Please note that with the feather on you may observe a higher risk of job falling due to reach max consequence stage failure limit, esp. for large jobs running on a big cluster.
How was this patch tested?
TBD