Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Jul 31, 2018

What changes were proposed in this pull request?

Check on job submit to make sure we don't launch a barrier stage with unsupported RDD chain pattern. The following patterns are not supported:

  • Ancestor RDDs that have different number of partitions from the resulting RDD (eg. union()/coalesce()/first()/PartitionPruningRDD);
  • An RDD that depends on multiple barrier RDDs (eg. barrierRdd1.zip(barrierRdd2)).

How was this patch tested?

Add test cases in BarrierStageOnSubmittedSuite.

@holdensmagicalunicorn
Copy link

@jiangxb1987, thanks! I am a bot who has found some folks who might be able to help with the review:@squito, @mateiz and @rxin

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93820 has finished for PR 21927 at commit 0733bfb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93827 has finished for PR 21927 at commit 0733bfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one pass.

}

/**
* Traverse all the parent RDDs within the same stage with the given RDD, check whether all the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also checks the RDD itself. "Traverses the given RDD and its ancestors within the same stage and checks whether all of the RDDs satisfy a given predicate."

)

val error = intercept[SparkException] {
ThreadUtils.awaitResult(futureAction, 1 seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the timeout slightly larger like 5 seconds to buffer unexpected pause/slow down.

val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("test")
sc = new SparkContext(conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalSparkContext already provides an active context.

}
visited += toVisit
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: shuffleDep is not used. You can use _.

class BarrierStageOnSubmittedSuite extends SparkFunSuite with LocalSparkContext {

private def testSubmitJob(sc: SparkContext, rdd: RDD[Int], message: String): Unit = {
val futureAction = sc.submitJob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the same as the following?

private def testSubmitJob(rdd, message): Unit = {
  val err = intercept[SparkException] {
    rdd.count()
  }.getCause.getMessage
  assert(err.contain(message)
}

Okay to keep the current version if we do want to ensure this is from submitJob().

private def checkBarrierStageWithPartitionPruningRDD(rdd: RDD[_]): Unit = {
if (rdd.isBarrier() &&
!traverseParentRDDsWithinStage(rdd, (r => !r.isInstanceOf[PartitionPruningRDD[_]]))) {
throw new SparkException("Don't support run a barrier stage that contains " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since the error message is used in the test, it would be nice to make it a package private constant and use it in the test.
  • "Barrier execution mode does not support partition pruning (PartitionPruningRDD)." should be sufficient.

.barrier()
.mapPartitions((iter, context) => iter)
testSubmitJob(sc, rdd,
"Don't support run a barrier stage that contains PartitionPruningRDD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Define the message as a constant.

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2018

@jiangxb1987 Second thought: PartitionPruningRDD is just an implementation of RDD. Every user / developer can implement a similar one. Also this doesn't handle the case mentioned by @felixcheung : a.union(b).barrier(). So I'm thinking about checking number of partitions instead of instances of PartitionPruningRDD in this PR. Basically, we check the input RDD and all its parents have the same number of partitions. If not, we throw an error message like "Barrier execution mode doesn't support partition union / pruning.". Thoughts?

Btw, we should provide more info to users in the error message. For example, user might use "first()" without understanding "partition pruning".

cc: @gatorsmile

@squito
Copy link
Contributor

squito commented Jul 31, 2018

Second thought: PartitionPruningRDD is just an implementation of RDD. Every user / developer can implement a similar one. Also this doesn't handle the case mentioned by @felixcheung : a.union(b).barrier(). So I'm thinking about checking number of partitions instead of instances of PartitionPruningRDD in this PR. Basically, we check the input RDD and all its parents have the same number of partitions. If not, we throw an error message like "Barrier execution mode doesn't support partition union / pruning.". Thoughts?

yeah thats a good point, but what about coalesce()?? that should actually work, shouldn't it? Maybe you'd add an exception for CoalescedRDD, or add another property for processAllInputPartitions or something ...

@jiangxb1987 jiangxb1987 changed the title [SPARK-24820][Core] Fail fast when submitted job contains PartitionPruningRDD in a barrier stage [SPARK-24820][SPARK-24821][Core] Fail fast when submitted job contains a barrier stage with unsupported RDD chain pattern Aug 1, 2018
* Check to make sure we don't launch a barrier stage with unsupported RDD chain pattern. The
* following patterns are not supported:
* 1. Ancestor RDDs that have different number of partitions from the resulting RDD (eg.
* union()/coalesce()/first()/PartitionPruningRDD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but coalesce should be OK, right? Is it just too fragile to allow coalesce while excluding the others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coalesce() is not safe when shuffle is false because it may cause the number of tasks doesn't match the number of partitions for the RDD that uses barrier mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see that it'll be a different number of partitions, but conceptually it should be OK, right? the user just wants all tasks launched together, even if its a different number of tasks than the number of partitions in the original barrier rdd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but anyway, I guess its also fine to not support this case, I was just trying to understand myself.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93880 has finished for PR 21927 at commit 48f1ef4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some minor inline comments.

* union()/coalesce()/first()/PartitionPruningRDD);
* 2. An RDD that depends on multiple barrier RDDs (eg. barrierRdd1.zip(barrierRdd2)).
*/
private def checkBarrierStageWithRDDChainPattern(rdd: RDD[_], numPartitions: Int): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to rename numPartitions to numTasksInStage (or a better name).

val ERROR_MESSAGE_RUN_BARRIER_WITH_UNSUPPORTED_RDD_CHAIN_PATTERN =
"[SPARK-24820][SPARK-24821]: Barrier execution mode does not allow the following pattern of " +
"RDD chain within a barrier stage:\n1. Ancestor RDDs that have different number of " +
"partitions from the resulting RDD (eg. union()/coalesce()/first()/PartitionPruningRDD);\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also list take(). It would be nice to provide a workaround for first() and take(): barrierRdd.collect().head (scala), barrierRdd.collect()[0] (python)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect() is expensive though?

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93963 has finished for PR 21927 at commit bb819f3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93970 has finished for PR 21927 at commit bb819f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2018

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 38e4699 Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants