Skip to content

[SPARK-20219] Schedule tasks based on size of input from ScheduledRDD#17533

Closed
jinxing64 wants to merge 9 commits intoapache:masterfrom
jinxing64:SPARK-20219
Closed

[SPARK-20219] Schedule tasks based on size of input from ScheduledRDD#17533
jinxing64 wants to merge 9 commits intoapache:masterfrom
jinxing64:SPARK-20219

Conversation

@jinxing64
Copy link
Copy Markdown

What changes were proposed in this pull request?

When data is highly skewed on ShuffledRDD, it make sense to launch those tasks which process much more input as soon as possible. The current scheduling mechanism in TaskSetManager is quite simple:

for (i <- (0 until numTasks).reverse) {
    addPendingTask(i)
}

In scenario that "large tasks" locate at bottom half of tasks array, if tasks with much more input are launched early, we can significantly reduce the time cost and save resource when "dynamic allocation" is disabled.

How was this patch tested?

Added unit test in 'TaskSetManagerSuite'.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75529 has started for PR 17533 at commit f757e41.

@jinxing64 jinxing64 changed the title [SPARK-20219] Schedule tasks based on size of input from ScheduledRDD [WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 5, 2017
@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75531 has started for PR 17533 at commit 6d18a09.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75532 has started for PR 17533 at commit 878d676.

@mridulm
Copy link
Copy Markdown
Contributor

mridulm commented Apr 5, 2017

Tasks are scheduled by locality (which includes shuffle tasks too to some extent).
This is making a lot of state mutable within TSM - is there any tests done which show improvements due to this change ? Or it an expectation that it will improve ?

@jinxing64
Copy link
Copy Markdown
Author

Yes, I did the test in my cluster. In highly-skew stage, the time cost can be reduced significantly. Tasks are scheduled with locality preference. But in current code, input size of tasks are not taken into consideration. Think about this scenario:

  1. There are 9 partitions(0~8) in the ShuffledRDD and size of partition-8 is 8 times of the previous 8 partitions. (Lets assume that time cost of task has linear relation with the size of input and time cost of first 8 tasks is 1 and the time cost of the last task is 8.)
  2. Tasks are scheduled on 2 executors.

In current code, the tasks are scheduled in serial order and task for partition-8 will be the last one to launch and the time cost is 12.
With this change, task for partition-8 will be scheduled first and the time cost will be reduced to 8.

This change is related to SPARK-19100. In my prod env, skew situations happens mostly on ShuffledRDD. Thus this pr proposes to consider the size of input from ShuffledRDD when scheduling. This change can bring benefit when skew situations and won't have negative impact on performance in other scenarios.


// Set containing all pending tasks (also used as a stack, as above).
private val allPendingTasks = new ArrayBuffer[Int]
private var allPendingTasks = new ArrayBuffer[Int]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make this var, because I don't have a better approach to sort ArrayBuffer in place. Advice?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, unfortunately.

Copy link
Copy Markdown
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always wary of touching task scheduling and don't feel that comfortable approving it, but the idea is plausible.

t.epoch = epoch
}

val sortedPendingTasks = new AtomicBoolean(false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be private?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be.

blacklist.isExecutorBlacklistedForTaskSet(execId)
}
if (!isZombie && !offerBlacklisted) {
if (!sortedPendingTasks.get()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need if (sortedPendingTasks.compareAndSet(false, true)) or else the point of AtomicBoolean is kind of lost

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should refine :)


private[this] def sortPendingTasks(): Unit = {
val taskIndexs = (0 until numTasks).toArray
implicit def ord = new Ordering[Int] {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clearer to use sortWith below and pass the ordering explicitly?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so:)

// Visible for testing
private[spark] def setTaskInputSizeFromShuffledRDD(inputSize: Map[Task[_], Long]) = {
taskInputSizeFromShuffledRDD.clear()
inputSize.foreach{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss something but is this not just adding all entries from one Map to another? does ++ do this directly?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be refined :)

val taskIndexs = (0 until numTasks).toArray
implicit def ord = new Ordering[Int] {
override def compare(x: Int, y: Int): Int =
getTaskInputSizeFromShuffledRDD(tasks(x)) compare
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and use x.compare(y) rather than omit the syntax


// Set containing all pending tasks (also used as a stack, as above).
private val allPendingTasks = new ArrayBuffer[Int]
private var allPendingTasks = new ArrayBuffer[Int]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, unfortunately.

case Some(size) => size
case None =>
val size =
sched.dagScheduler.parentSplitsInShuffledRDD(task.stageId, task.partitionId) match {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might still be clearer as .getOrElse(..... , 0L)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be made clearer. But sorry, I didn't find a function like getOrElse(func, 0L).

Copy link
Copy Markdown
Member

@srowen srowen Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, really I mean ...map(parentSplits => ...).getOrElse(0L)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it :)

@jinxing64 jinxing64 changed the title [WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD [SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 5, 2017
@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75538 has finished for PR 17533 at commit e4af778.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75542 has finished for PR 17533 at commit 462d92e.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75543 has finished for PR 17533 at commit 97afe0a.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Copy Markdown
Contributor

I'm hesitant about this and posted some comments on the JIRA (we should try to keep high-level discussion about whether this change makes sense there, so it's easier to reference in the future and not tangled up in the low-level PR comments)

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 5, 2017

Test build #75547 has finished for PR 17533 at commit b0c3abc.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Copy Markdown
Author

@kayousterhout
Thanks a lot for comment and sorry for late reply. I replied your comment from JIRA. Please take a look when you have time :)

Copy link
Copy Markdown
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kayousterhout @mridulm what do you think about the refactor I suggested? Maybe that wouldn't really increase the complexity significantly?

@jinxing64 if you're really motivated, you could try it out and see how things look, though no promises yet ...

case Some(size) => size
case None =>
val size =
sched.dagScheduler.parentSplitsInShuffledRDD(task.stageId, task.partitionId).map {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the major complaint is this call, we don't want the TSM requesting info from the DAGSCheduler. But you could change that -- instead the DAGScheduler could push this info into the TSM after it has the input sizes. This actually might not be that bad, since in any case the DAGScheduler has to know this info when it calls submitMissingTasks. I don't think this info should change at all after the taskset has been submitted, right, so you'd have it all available at construction time?

assert(manager.resourceOffer("exec", "host", ANY).get.index === 3)
assert(manager.resourceOffer("exec", "host", ANY).get.index === 1)
assert(manager.resourceOffer("exec", "host", ANY).get.index === 0)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd also want a test to make sure the sizes were getting computed correctly. (I think that might be easier to do with the refactor I suggested?)

@jinxing64
Copy link
Copy Markdown
Author

@squito
Thank you so much for taking look into this.

we don't want the TSM requesting info from the DAGSCheduler

Sorry I missed this point for the previous change. Now I push the info(size of input from ShuffledRDD) when create TSM.
Also I added a test in DAGSchedulerSuite to check the sizes are getting computed correctly.
Thanks a lot again :) and hope I understand your comment correctly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 9, 2017

Test build #75634 has finished for PR 17533 at commit dbacfc2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64 jinxing64 changed the title [SPARK-20219] Schedule tasks based on size of input from ScheduledRDD [WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 10, 2017
@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 11, 2017

Test build #75695 has started for PR 17533 at commit fd9bc68.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 11, 2017

Test build #75697 has finished for PR 17533 at commit e3a15c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a lot more complicated than it needs to be. You should be able to simplify significantly by looking at what the code does for the "map-stage jobs", and how those MapStatistics are used later -- a left a couple of inline comments hinting at that, though I didn't figure out all the details.

fwiw, I'm no longer opposed to this for complicating the relationship between the DAGScheduler & TSM. This version maintains the current separation. Still, I do think in current form this is still introducing too much complexity. If it can be simplified a lot, then I might be more OK with it.

}

// Visible for testing.
private[spark] def getTaskInputSizesFromShuffledRDD(tasks: Seq[Task[_]]): Map[Task[_], Long] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like this needs to be exposed at all for tests. (and if it were used in tests, could probably be a little tighter as private[scheduler].)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be refined. :)

val noPartitionerConflict = rdd.partitioner match {
case Some(partitioner) =>
partitioner.isInstanceOf[HashPartitioner] &&
dep.partitioner.isInstanceOf[HashPartitioner] &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) I don't really understand what is going on here. why would rdd.partitioner ever be different from one of the shuffle dependencies partitioner? i thought shuffle dependencies always have to have the same partitioner?
(if there is a good reason, probably need a comment in the code)

(b) if this is needed, can probably just be partitioner == dep.partitioner -- that is simpler, equivalent for HashPartitioner, and allows it to still work for other partitioners as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I always think rdd.partitioner should be the same with shuffle dependencies partitioner. But I found CustomShuffledRDD is a different one.


/**
* Get ancestor splits in ShuffledRDD.
*/
Copy link
Copy Markdown
Contributor

@squito squito Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is called both parents and ancestors, which is confusing. I think it would be most accurate to call it the stage parent, and note that requires traversing some distance up the set of RDD ancestors.

Also I commented on this a bit below ... in general this is pretty confusing. It seems like there are a lot of cases which get ignored and I am not certain that's always OK.

Couldn't this entire thing be replaced with

val deps = getShuffleDependencies(rdd)
val partitioner = deps.head.partitioner
// make sure the partitioner is consistent across all our shuffle dependencies.
assert(deps.forall{_.partioner == partitioner})
val allStats = deps.map{mapOutputTracker.getStatistics(_.shuffleId)}
// TODO sum the stats per partition and go from there

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is confusing and I need to refine this.
I'm a little bit hesitant to use getShuffleDependencies. I need to get the total size of input from ShuffledRDD for every child's partition. After transformations like CoalescedRDD, there may be not a consistent one-to-one match between ancestor's partition index and child's partition index.

parentSplits.map {
case (shuffleId, splits) =>
splits.map(mapOutputTracker.getMapSizesByExecutorId(shuffleId, _)
.flatMap(_._2.map(_._2)).sum).sum
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use mapOutputTracker.getStatistics here.

It also occurs to me that in general this could use the total input size for the task, but I guess spark isn't looking at that in general yet (though it probably could, from hadoop's InputSplit.getLength()). Just something to keep in mind.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also occurs to me that in general this could use the total input size for the task, but I guess spark isn't looking at that in general yet (though it probably could, from hadoop's InputSplit.getLength()). Just something to keep in mind.

Agree :)

t.epoch = epoch
}

private val sortedPendingTasks = new AtomicBoolean(false)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this stuff w/ the delayed sorting etc. is now totally unnecessary. If you've got input sizes, then just sort the tasks when the TSM is created, which avoids a lot of complexity.

Perhaps the tasks should even just get sorted when the TaskSet is created in the first place, and then this doesn't know or care that the tasks have been sorted in any particular way.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in current change, I do the ordering when create TaskSet, there is no change in TSM now. Thanks a lot for suggestion :)

@jinxing64
Copy link
Copy Markdown
Author

@squito
Thank you so much for reviewing thus far and sorry for the complexity I bring in.
I tried to simplify the code according to your comment and please take another look when tests passed. :)

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 14, 2017

Test build #75800 has finished for PR 17533 at commit b44f1df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Apr 14, 2017

Test build #75802 has finished for PR 17533 at commit 7212089.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Copy Markdown
Author

I think the failed unit test can be fixed in #17634 and #17603

@jinxing64 jinxing64 changed the title [WIP][SPARK-20219] Schedule tasks based on size of input from ScheduledRDD [SPARK-20219] Schedule tasks based on size of input from ScheduledRDD Apr 19, 2017
@HyukjinKwon
Copy link
Copy Markdown
Member

Hi @jinxing64, how is it going?

@jinxing64
Copy link
Copy Markdown
Author

@HyukjinKwon
Sorry, I will close this for now and make another pr if there's progress.

@jinxing64 jinxing64 closed this Jun 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants