[SPARK-20219] Schedule tasks based on size of input from ScheduledRDD by jinxing64 · Pull Request #17533 · apache/spark

jinxing64 · 2017-04-05T03:36:06Z

What changes were proposed in this pull request?

When data is highly skewed on ShuffledRDD, it make sense to launch those tasks which process much more input as soon as possible. The current scheduling mechanism in TaskSetManager is quite simple:

for (i <- (0 until numTasks).reverse) {
    addPendingTask(i)
}

In scenario that "large tasks" locate at bottom half of tasks array, if tasks with much more input are launched early, we can significantly reduce the time cost and save resource when "dynamic allocation" is disabled.

How was this patch tested?

Added unit test in 'TaskSetManagerSuite'.

SparkQA · 2017-04-05T03:37:30Z

Test build #75529 has started for PR 17533 at commit f757e41.

SparkQA · 2017-04-05T03:42:31Z

Test build #75531 has started for PR 17533 at commit 6d18a09.

SparkQA · 2017-04-05T03:52:36Z

Test build #75532 has started for PR 17533 at commit 878d676.

mridulm · 2017-04-05T09:09:09Z

Tasks are scheduled by locality (which includes shuffle tasks too to some extent).
This is making a lot of state mutable within TSM - is there any tests done which show improvements due to this change ? Or it an expectation that it will improve ?

jinxing64 · 2017-04-05T10:04:14Z

Yes, I did the test in my cluster. In highly-skew stage, the time cost can be reduced significantly. Tasks are scheduled with locality preference. But in current code, input size of tasks are not taken into consideration. Think about this scenario:

There are 9 partitions(0~8) in the ShuffledRDD and size of partition-8 is 8 times of the previous 8 partitions. (Lets assume that time cost of task has linear relation with the size of input and time cost of first 8 tasks is 1 and the time cost of the last task is 8.)
Tasks are scheduled on 2 executors.

In current code, the tasks are scheduled in serial order and task for partition-8 will be the last one to launch and the time cost is 12.
With this change, task for partition-8 will be scheduled first and the time cost will be reduced to 8.

This change is related to SPARK-19100. In my prod env, skew situations happens mostly on ShuffledRDD. Thus this pr proposes to consider the size of input from ShuffledRDD when scheduling. This change can bring benefit when skew situations and won't have negative impact on performance in other scenarios.

jinxing64 · 2017-04-05T10:08:38Z


  // Set containing all pending tasks (also used as a stack, as above).
-  private val allPendingTasks = new ArrayBuffer[Int]
+  private var allPendingTasks = new ArrayBuffer[Int]


I make this var, because I don't have a better approach to sort ArrayBuffer in place. Advice?

I don't think so, unfortunately.

srowen

I'm always wary of touching task scheduling and don't feel that comfortable approving it, but the idea is plausible.

srowen · 2017-04-05T10:40:21Z

    t.epoch = epoch
  }

+  val sortedPendingTasks = new AtomicBoolean(false)


Can this be private?

Yes, it should be.

srowen · 2017-04-05T10:41:37Z

        blacklist.isExecutorBlacklistedForTaskSet(execId)
    }
    if (!isZombie && !offerBlacklisted) {
+      if (!sortedPendingTasks.get()) {


I think you need if (sortedPendingTasks.compareAndSet(false, true)) or else the point of AtomicBoolean is kind of lost

Yes, I should refine :)

srowen · 2017-04-05T10:42:15Z


+  private[this] def sortPendingTasks(): Unit = {
+    val taskIndexs = (0 until numTasks).toArray
+    implicit def ord = new Ordering[Int] {


Maybe clearer to use sortWith below and pass the ordering explicitly?

Yes, I think so:)

srowen · 2017-04-05T10:43:17Z

+  // Visible for testing
+  private[spark] def setTaskInputSizeFromShuffledRDD(inputSize: Map[Task[_], Long]) = {
+    taskInputSizeFromShuffledRDD.clear()
+    inputSize.foreach{


I might miss something but is this not just adding all entries from one Map to another? does ++ do this directly?

Yes, it should be refined :)

srowen · 2017-04-05T10:48:18Z

+    val taskIndexs = (0 until numTasks).toArray
+    implicit def ord = new Ordering[Int] {
+      override def compare(x: Int, y: Int): Int =
+        getTaskInputSizeFromShuffledRDD(tasks(x)) compare


Go ahead and use x.compare(y) rather than omit the syntax

srowen · 2017-04-05T10:58:45Z


  // Set containing all pending tasks (also used as a stack, as above).
-  private val allPendingTasks = new ArrayBuffer[Int]
+  private var allPendingTasks = new ArrayBuffer[Int]


I don't think so, unfortunately.

srowen · 2017-04-05T10:59:11Z

+      case Some(size) => size
+      case None =>
+        val size =
+          sched.dagScheduler.parentSplitsInShuffledRDD(task.stageId, task.partitionId) match {


This might still be clearer as .getOrElse(..... , 0L)

Yes, this should be made clearer. But sorry, I didn't find a function like getOrElse(func, 0L).

Oh, really I mean ...map(parentSplits => ...).getOrElse(0L)

I got it :)

SparkQA · 2017-04-05T13:07:33Z

Test build #75538 has finished for PR 17533 at commit e4af778.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-05T14:17:34Z

Test build #75542 has finished for PR 17533 at commit 462d92e.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-05T16:42:37Z

Test build #75543 has finished for PR 17533 at commit 97afe0a.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-04-05T17:13:31Z

I'm hesitant about this and posted some comments on the JIRA (we should try to keep high-level discussion about whether this change makes sense there, so it's easier to reference in the future and not tangled up in the low-level PR comments)

SparkQA · 2017-04-05T18:42:40Z

Test build #75547 has finished for PR 17533 at commit b0c3abc.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-04-07T11:07:00Z

@kayousterhout
Thanks a lot for comment and sorry for late reply. I replied your comment from JIRA. Please take a look when you have time :)

squito

@kayousterhout @mridulm what do you think about the refactor I suggested? Maybe that wouldn't really increase the complexity significantly?

@jinxing64 if you're really motivated, you could try it out and see how things look, though no promises yet ...

squito · 2017-04-07T16:56:28Z

+      case Some(size) => size
+      case None =>
+        val size =
+          sched.dagScheduler.parentSplitsInShuffledRDD(task.stageId, task.partitionId).map {


I think the major complaint is this call, we don't want the TSM requesting info from the DAGSCheduler. But you could change that -- instead the DAGScheduler could push this info into the TSM after it has the input sizes. This actually might not be that bad, since in any case the DAGScheduler has to know this info when it calls submitMissingTasks. I don't think this info should change at all after the taskset has been submitted, right, so you'd have it all available at construction time?

squito · 2017-04-07T16:57:18Z

+    assert(manager.resourceOffer("exec", "host", ANY).get.index === 3)
+    assert(manager.resourceOffer("exec", "host", ANY).get.index === 1)
+    assert(manager.resourceOffer("exec", "host", ANY).get.index === 0)
+  }


we'd also want a test to make sure the sizes were getting computed correctly. (I think that might be easier to do with the refactor I suggested?)

jinxing64 · 2017-04-09T14:13:17Z

@squito
Thank you so much for taking look into this.

we don't want the TSM requesting info from the DAGSCheduler

Sorry I missed this point for the previous change. Now I push the info(size of input from ShuffledRDD) when create TSM.
Also I added a test in DAGSchedulerSuite to check the sizes are getting computed correctly.
Thanks a lot again :) and hope I understand your comment correctly.

SparkQA · 2017-04-09T16:12:15Z

Test build #75634 has finished for PR 17533 at commit dbacfc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-11T06:37:33Z

Test build #75695 has started for PR 17533 at commit fd9bc68.

SparkQA · 2017-04-11T10:18:43Z

Test build #75697 has finished for PR 17533 at commit e3a15c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

I think this is a lot more complicated than it needs to be. You should be able to simplify significantly by looking at what the code does for the "map-stage jobs", and how those MapStatistics are used later -- a left a couple of inline comments hinting at that, though I didn't figure out all the details.

fwiw, I'm no longer opposed to this for complicating the relationship between the DAGScheduler & TSM. This version maintains the current separation. Still, I do think in current form this is still introducing too much complexity. If it can be simplified a lot, then I might be more OK with it.

squito · 2017-04-12T19:01:53Z

  }

+  // Visible for testing.
+  private[spark] def getTaskInputSizesFromShuffledRDD(tasks: Seq[Task[_]]): Map[Task[_], Long] = {


doesn't look like this needs to be exposed at all for tests. (and if it were used in tests, could probably be a little tighter as private[scheduler].)

Yes, should be refined. :)

squito · 2017-04-12T21:28:19Z

+                val noPartitionerConflict = rdd.partitioner match {
+                  case Some(partitioner) =>
+                    partitioner.isInstanceOf[HashPartitioner] &&
+                    dep.partitioner.isInstanceOf[HashPartitioner] &&


(a) I don't really understand what is going on here. why would rdd.partitioner ever be different from one of the shuffle dependencies partitioner? i thought shuffle dependencies always have to have the same partitioner?
(if there is a good reason, probably need a comment in the code)

(b) if this is needed, can probably just be partitioner == dep.partitioner -- that is simpler, equivalent for HashPartitioner, and allows it to still work for other partitioners as well.

Yes, I always think rdd.partitioner should be the same with shuffle dependencies partitioner. But I found CustomShuffledRDD is a different one.

squito · 2017-04-12T21:52:43Z


+  /**
+   * Get ancestor splits in ShuffledRDD.
+   */


this is called both parents and ancestors, which is confusing. I think it would be most accurate to call it the stage parent, and note that requires traversing some distance up the set of RDD ancestors.

Also I commented on this a bit below ... in general this is pretty confusing. It seems like there are a lot of cases which get ignored and I am not certain that's always OK.

Couldn't this entire thing be replaced with

val deps = getShuffleDependencies(rdd) val partitioner = deps.head.partitioner // make sure the partitioner is consistent across all our shuffle dependencies. assert(deps.forall{_.partioner == partitioner}) val allStats = deps.map{mapOutputTracker.getStatistics(_.shuffleId)} // TODO sum the stats per partition and go from there

Yes, this is confusing and I need to refine this.
I'm a little bit hesitant to use getShuffleDependencies. I need to get the total size of input from ShuffledRDD for every child's partition. After transformations like CoalescedRDD, there may be not a consistent one-to-one match between ancestor's partition index and child's partition index.

squito · 2017-04-12T22:05:47Z

+              parentSplits.map {
+                case (shuffleId, splits) =>
+                  splits.map(mapOutputTracker.getMapSizesByExecutorId(shuffleId, _)
+                    .flatMap(_._2.map(_._2)).sum).sum


I think you can use mapOutputTracker.getStatistics here.

It also occurs to me that in general this could use the total input size for the task, but I guess spark isn't looking at that in general yet (though it probably could, from hadoop's InputSplit.getLength()). Just something to keep in mind.

It also occurs to me that in general this could use the total input size for the task, but I guess spark isn't looking at that in general yet (though it probably could, from hadoop's InputSplit.getLength()). Just something to keep in mind.

Agree :)

squito · 2017-04-12T22:16:06Z

    t.epoch = epoch
  }

+  private val sortedPendingTasks = new AtomicBoolean(false)


I think all of this stuff w/ the delayed sorting etc. is now totally unnecessary. If you've got input sizes, then just sort the tasks when the TSM is created, which avoids a lot of complexity.

Perhaps the tasks should even just get sorted when the TaskSet is created in the first place, and then this doesn't know or care that the tasks have been sorted in any particular way.

Yes, in current change, I do the ordering when create TaskSet, there is no change in TSM now. Thanks a lot for suggestion :)

jinxing64 · 2017-04-14T08:34:09Z

@squito
Thank you so much for reviewing thus far and sorry for the complexity I bring in.
I tried to simplify the code according to your comment and please take another look when tests passed. :)

SparkQA · 2017-04-14T10:09:45Z

Test build #75800 has finished for PR 17533 at commit b44f1df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-14T10:48:07Z

Test build #75802 has finished for PR 17533 at commit 7212089.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-04-14T12:33:19Z

I think the failed unit test can be fixed in #17634 and #17603

HyukjinKwon · 2017-06-02T12:57:10Z

Hi @jinxing64, how is it going?

jinxing64 · 2017-06-02T16:14:19Z

@HyukjinKwon
Sorry, I will close this for now and make another pr if there's progress.

jinxing added 2 commits April 4, 2017 23:25

Sort tasks based on their size.

bca146c

Add unit test.

f757e41

Refine unit test.

6d18a09

jinxing64 changed the title ~~[SPARK-20219] Schedule tasks based on size of input from ScheduledRDD~~ [WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 5, 2017

small fix

e4af778

jinxing64 force-pushed the SPARK-20219 branch from 878d676 to e4af778 Compare April 5, 2017 08:53

small fix

462d92e

jinxing64 commented Apr 5, 2017

View reviewed changes

srowen requested changes Apr 5, 2017

View reviewed changes

Refine according to Owen's comments.

97afe0a

jinxing64 changed the title ~~[WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD~~ [SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 5, 2017

small fix.

b0c3abc

squito reviewed Apr 7, 2017

View reviewed changes

jinxing64 changed the title ~~[SPARK-20219] Schedule tasks based on size of input from ScheduledRDD~~ [WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 10, 2017

jinxing64 mentioned this pull request Apr 11, 2017

[SPARK-20288] Avoid generating the MapStatus by stageId in BasicSchedulerIntegrationSuite #17603

Closed

jinxing64 force-pushed the SPARK-20219 branch from dbacfc2 to fd9bc68 Compare April 11, 2017 06:33

Refine for squito's comments.

e3a15c3

jinxing64 force-pushed the SPARK-20219 branch from fd9bc68 to e3a15c3 Compare April 11, 2017 07:30

squito suggested changes Apr 12, 2017

View reviewed changes

jinxing64 mentioned this pull request Apr 14, 2017

[SPARK-20333] HashPartitioner should be compatible with num of child RDD's partitions. #17634

Closed

Simplify the code.

7212089

jinxing64 force-pushed the SPARK-20219 branch from b44f1df to 7212089 Compare April 14, 2017 08:29

jinxing64 changed the title ~~[WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD~~ [SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 19, 2017

jinxing64 closed this Jun 2, 2017

Conversation

jinxing64 commented Apr 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

mridulm commented Apr 5, 2017

Uh oh!

jinxing64 commented Apr 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen Apr 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

kayousterhout commented Apr 5, 2017

Uh oh!

SparkQA commented Apr 5, 2017

Uh oh!

jinxing64 commented Apr 7, 2017

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Apr 9, 2017

Uh oh!

SparkQA commented Apr 9, 2017

Uh oh!

SparkQA commented Apr 11, 2017

Uh oh!

SparkQA commented Apr 11, 2017

Uh oh!

squito left a comment

srowen Apr 5, 2017 •

edited

Loading

squito Apr 12, 2017 •

edited

Loading