[SPARK-24820][SPARK-24821][Core] Fail fast when submitted job contains a barrier stage with unsupported RDD chain pattern #21927

jiangxb1987 · 2018-07-31T03:31:40Z

What changes were proposed in this pull request?

Check on job submit to make sure we don't launch a barrier stage with unsupported RDD chain pattern. The following patterns are not supported:

Ancestor RDDs that have different number of partitions from the resulting RDD (eg. union()/coalesce()/first()/PartitionPruningRDD);
An RDD that depends on multiple barrier RDDs (eg. barrierRdd1.zip(barrierRdd2)).

How was this patch tested?

Add test cases in BarrierStageOnSubmittedSuite.

…r stage

holdensmagicalunicorn · 2018-07-31T03:31:45Z

@jiangxb1987, thanks! I am a bot who has found some folks who might be able to help with the review:@squito, @mateiz and @rxin

SparkQA · 2018-07-31T07:05:01Z

Test build #93820 has finished for PR 21927 at commit 0733bfb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-07-31T07:41:00Z

retest this please

SparkQA · 2018-07-31T12:17:24Z

Test build #93827 has finished for PR 21927 at commit 0733bfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr

Made one pass.

mengxr · 2018-07-31T16:14:34Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

  }

+  /**
+   * Traverse all the parent RDDs within the same stage with the given RDD, check whether all the


It also checks the RDD itself. "Traverses the given RDD and its ancestors within the same stage and checks whether all of the RDDs satisfy a given predicate."

mengxr · 2018-07-31T16:17:36Z

core/src/test/scala/org/apache/spark/BarrierStageOnSubmittedSuite.scala

+    )
+
+    val error = intercept[SparkException] {
+      ThreadUtils.awaitResult(futureAction, 1 seconds)


I would make the timeout slightly larger like 5 seconds to buffer unexpected pause/slow down.

mengxr · 2018-07-31T16:18:34Z

core/src/test/scala/org/apache/spark/BarrierStageOnSubmittedSuite.scala

+    val conf = new SparkConf()
+      .setMaster("local[4]")
+      .setAppName("test")
+    sc = new SparkContext(conf)


LocalSparkContext already provides an active context.

mengxr · 2018-07-31T16:19:37Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+        }
+        visited += toVisit
+        toVisit.dependencies.foreach {
+          case shuffleDep: ShuffleDependency[_, _, _] =>


minor: shuffleDep is not used. You can use _.

mengxr · 2018-07-31T16:22:34Z

core/src/test/scala/org/apache/spark/BarrierStageOnSubmittedSuite.scala

+class BarrierStageOnSubmittedSuite extends SparkFunSuite with LocalSparkContext {
+
+  private def testSubmitJob(sc: SparkContext, rdd: RDD[Int], message: String): Unit = {
+    val futureAction = sc.submitJob(


Is it the same as the following?

private def testSubmitJob(rdd, message): Unit = { val err = intercept[SparkException] { rdd.count() }.getCause.getMessage assert(err.contain(message) }

Okay to keep the current version if we do want to ensure this is from submitJob().

mengxr · 2018-07-31T16:26:25Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+  private def checkBarrierStageWithPartitionPruningRDD(rdd: RDD[_]): Unit = {
+    if (rdd.isBarrier() &&
+        !traverseParentRDDsWithinStage(rdd, (r => !r.isInstanceOf[PartitionPruningRDD[_]]))) {
+      throw new SparkException("Don't support run a barrier stage that contains " +


Since the error message is used in the test, it would be nice to make it a package private constant and use it in the test.

"Barrier execution mode does not support partition pruning (PartitionPruningRDD)." should be sufficient.

mengxr · 2018-07-31T16:27:01Z

core/src/test/scala/org/apache/spark/BarrierStageOnSubmittedSuite.scala

+      .barrier()
+      .mapPartitions((iter, context) => iter)
+    testSubmitJob(sc, rdd,
+      "Don't support run a barrier stage that contains PartitionPruningRDD")


Ditto. Define the message as a constant.

mengxr · 2018-07-31T17:21:42Z

@jiangxb1987 Second thought: PartitionPruningRDD is just an implementation of RDD. Every user / developer can implement a similar one. Also this doesn't handle the case mentioned by @felixcheung : a.union(b).barrier(). So I'm thinking about checking number of partitions instead of instances of PartitionPruningRDD in this PR. Basically, we check the input RDD and all its parents have the same number of partitions. If not, we throw an error message like "Barrier execution mode doesn't support partition union / pruning.". Thoughts?

Btw, we should provide more info to users in the error message. For example, user might use "first()" without understanding "partition pruning".

cc: @gatorsmile

squito · 2018-07-31T21:40:20Z

Second thought: PartitionPruningRDD is just an implementation of RDD. Every user / developer can implement a similar one. Also this doesn't handle the case mentioned by @felixcheung : a.union(b).barrier(). So I'm thinking about checking number of partitions instead of instances of PartitionPruningRDD in this PR. Basically, we check the input RDD and all its parents have the same number of partitions. If not, we throw an error message like "Barrier execution mode doesn't support partition union / pruning.". Thoughts?

yeah thats a good point, but what about coalesce()?? that should actually work, shouldn't it? Maybe you'd add an exception for CoalescedRDD, or add another property for processAllInputPartitions or something ...

squito · 2018-08-01T18:30:44Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+   * Check to make sure we don't launch a barrier stage with unsupported RDD chain pattern. The
+   * following patterns are not supported:
+   * 1. Ancestor RDDs that have different number of partitions from the resulting RDD (eg.
+   * union()/coalesce()/first()/PartitionPruningRDD);


but coalesce should be OK, right? Is it just too fragile to allow coalesce while excluding the others?

coalesce() is not safe when shuffle is false because it may cause the number of tasks doesn't match the number of partitions for the RDD that uses barrier mode.

OK I see that it'll be a different number of partitions, but conceptually it should be OK, right? the user just wants all tasks launched together, even if its a different number of tasks than the number of partitions in the original barrier rdd.

but anyway, I guess its also fine to not support this case, I was just trying to understand myself.

SparkQA · 2018-08-01T18:33:52Z

Test build #93880 has finished for PR 21927 at commit 48f1ef4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr

LGTM except some minor inline comments.

mengxr · 2018-08-02T05:56:35Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+   * union()/coalesce()/first()/PartitionPruningRDD);
+   * 2. An RDD that depends on multiple barrier RDDs (eg. barrierRdd1.zip(barrierRdd2)).
+   */
+  private def checkBarrierStageWithRDDChainPattern(rdd: RDD[_], numPartitions: Int): Unit = {


It would be nice to rename numPartitions to numTasksInStage (or a better name).

mengxr · 2018-08-02T06:01:58Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+  val ERROR_MESSAGE_RUN_BARRIER_WITH_UNSUPPORTED_RDD_CHAIN_PATTERN =
+    "[SPARK-24820][SPARK-24821]: Barrier execution mode does not allow the following pattern of " +
+      "RDD chain within a barrier stage:\n1. Ancestor RDDs that have different number of " +
+      "partitions from the resulting RDD (eg. union()/coalesce()/first()/PartitionPruningRDD);\n" +


Please also list take(). It would be nice to provide a workaround for first() and take(): barrierRdd.collect().head (scala), barrierRdd.collect()[0] (python)

collect() is expensive though?

SparkQA · 2018-08-02T07:05:01Z

Test build #93963 has finished for PR 21927 at commit bb819f3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-08-02T07:09:45Z

retest this please

SparkQA · 2018-08-02T12:08:12Z

Test build #93970 has finished for PR 21927 at commit bb819f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-08-02T16:37:05Z

LGTM. Merged into master. Thanks!

Fail fast when submitted job contains PartitionPruningRDD in a barrie…

0733bfb

…r stage

mengxr requested changes Jul 31, 2018

View reviewed changes

mengxr mentioned this pull request Jul 31, 2018

[SPARK-24821][Core] Fail fast when submitted job compute on a subset of all the partitions for a barrier stage #21918

Closed

update

48f1ef4

jiangxb1987 changed the title ~~[SPARK-24820][Core] Fail fast when submitted job contains PartitionPruningRDD in a barrier stage~~ [SPARK-24820][SPARK-24821][Core] Fail fast when submitted job contains a barrier stage with unsupported RDD chain pattern Aug 1, 2018

squito reviewed Aug 1, 2018

View reviewed changes

mengxr approved these changes Aug 2, 2018

View reviewed changes

update

bb819f3

asfgit closed this in 38e4699 Aug 2, 2018

[SPARK-24820][SPARK-24821][Core] Fail fast when submitted job contains a barrier stage with unsupported RDD chain pattern #21927

[SPARK-24820][SPARK-24821][Core] Fail fast when submitted job contains a barrier stage with unsupported RDD chain pattern #21927

Uh oh!

Conversation

jiangxb1987 commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

jiangxb1987 commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

mengxr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squito commented Jul 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

mengxr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

jiangxb1987 commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

mengxr commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jiangxb1987 commented Jul 31, 2018 •

edited

Loading

mengxr commented Jul 31, 2018 •

edited

Loading