[SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API contract#10932
Closed
JoshRosen wants to merge 2 commits intoapache:masterfrom
Closed
[SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API contract#10932JoshRosen wants to merge 2 commits intoapache:masterfrom
JoshRosen wants to merge 2 commits intoapache:masterfrom
Conversation
…s API contract Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array. If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html In order to guard against this infinite loop behavior, I think that Spark should fail-fast and refuse to compute RDDs' whose `partitions` violate the API contract.
Member
|
I'm +1 on this in 2.0 :) |
Contributor
Author
|
An open question is whether we want to put this in 1.6.1; this risks breaking user code which happened to work accidentally but also helps to guard against infinite loop behavior. |
Contributor
|
I don't think we should put it in 1.6.x. |
|
Test build #50137 has finished for PR 10932 at commit
|
Contributor
Author
|
Jenkins, retest this please. |
|
Test build #50157 has finished for PR 10932 at commit
|
Contributor
|
LGTM |
Contributor
|
Merging to master. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Spark's
PartitionandRDD.partitionsAPIs have a contract which requires custom implementations ofRDD.partitionsto ensure that for allx,rdd.partitions(x).index == x; in other words, theindexreported by a repartition needs to match its position in the partitions array.If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html
In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose
partitionsviolate the API contract.