[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism #16677

viirya · 2017-01-23T08:54:28Z

What changes were proposed in this pull request?

A logical Limit is performed physically by two operations LocalLimit and GlobalLimit.

Most of time, we gather all data into a single partition in order to run GlobalLimit. If we use a very big limit number, shuffling data causes performance issue also reduces parallelism.

We can avoid shuffling into single partition if we don't care data ordering. This patch implements this idea by doing a map stage during global limit. It collects the info of row numbers at each partition. For each partition, we locally retrieves limited data without any shuffling to finish this global limit.

For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions.

If the data partition has certain ordering, we can't distribute required rows evenly to each partitions because it could change data ordering. But we still can avoid shuffling.

How was this patch tested?

Jenkins tests.

viirya · 2017-01-23T08:55:39Z

cc @rxin @wzhfy @scwf

SparkQA · 2017-01-23T13:08:46Z

Test build #71833 has finished for PR 16677 at commit e067b10.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

viirya · 2017-01-23T14:26:10Z

retest this please.

SparkQA · 2017-01-23T17:19:54Z

Test build #71842 has finished for PR 16677 at commit 5dae2da.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-23T17:39:58Z

Test build #71856 has finished for PR 16677 at commit 0205cd9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-23T18:38:44Z

Test build #71848 has finished for PR 16677 at commit 5dae2da.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-24T04:06:26Z

Test build #71901 has finished for PR 16677 at commit 3dec117.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-24T05:54:41Z

Test build #71905 has finished for PR 16677 at commit 0a2e96f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-24T14:15:34Z

Test build #71931 has finished for PR 16677 at commit 4fb5e40.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-24T18:07:59Z

Test build #71936 has finished for PR 16677 at commit 9d4cadb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

viirya · 2017-01-25T00:52:39Z

also cc @cloud-fan and @hvanhovell

scwf · 2017-01-25T02:49:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

how about LocalPartitioning

scwf · 2017-01-25T02:50:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

i think here we should use the shuffle rdd to directly read the data from disk.

scwf · 2017-01-25T02:51:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

getOrElse(empty iter)?

Actually we won't reach here, but the change is ok.

scwf · 2017-01-25T02:53:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

its better to broadcast reduceAmounts

yeah, i have thought it before. forget to add it.

scwf · 2017-01-25T02:58:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

I think we can move the logical of construct the shuffled rdd to ShuffleExchange and in global limit we begin with the shuffle rdd.

add a new Distribution for fake partitioning

modify the ShuffledRowRDD to carry the row num of each partition

For this, I am more conservative. Because currently there are no other operators using this feature. So I would tend to not change ShuffleExchange right now.

SparkQA · 2017-01-25T03:53:36Z

Test build #71955 has finished for PR 16677 at commit 7f89c30.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FakePartitioning(orgPartition: Partitioning, numPartitions: Int) extends Partitioning
case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2017-01-25T05:37:47Z

Test build #71972 has started for PR 16677 at commit def10e6.

SparkQA · 2017-01-25T05:57:39Z

Test build #71973 has started for PR 16677 at commit 4e31bb7.

hvanhovell · 2018-08-10T09:31:22Z

Merging to master. Thanks!

viirya · 2018-08-10T18:20:51Z

Thank you! @hvanhovell

cloud-fan · 2018-08-25T08:00:49Z

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-limit.sql

 -- A test suite for IN LIMIT in parent side, subquery, and both predicate subquery
 -- It includes correlated cases.

+-- Disable global limit optimization


do we have a problem here?

This disables the optimization to get the limited values exactly the same as the current golden results.

ah i see, thanks!

hvanhovell · 2018-08-25T18:21:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

+    // partitions. If disabled, scanning data partitions sequentially until reaching limit number.
+    // Besides, if child output has certain ordering, we can't evenly pick up rows from
+    // each parititon.
+    val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && child.outputOrdering == Nil


@viirya dumb question, what is child.outputOrdering doing here? I am not entirely sure that we should guarantee that you should get the lowest elements of a dataset if you perform a limit in the middle of a query (a top level sort-limit does have this guarantee). I also don't think the SQL standard supports/mandates this.

Moreover checking child.outputOrdering only checks the order of the partition and not the order of the frame as a whole. You should also add the child.outputPartitioning.

I would be slightly in favor of removing the child.outputOrdering check.

If we remove it, we may need to feature flag it first since people may rely on the old behavior. Anyway all of this is up for debate.

For a query like select * from table order by a limit 10, I think the expected semantics is going to return top 10 elements, not any 10 elements. In order to not change this behavior, I add this check.

Moreover checking child.outputOrdering only checks the order of the partition and not the order of the frame as a whole. You should also add the child.outputPartitioning.

I think you are correct. We need to check child.outputPartitioning. I think we need to check there is a RangePartitioning. The check should be the child is a range partitioning and has some output ordering. WDYT?

I am not entirely sure that we should guarantee that you should get the lowest elements of a dataset if you perform a limit in the middle of a query (a top level sort-limit does have this guarantee). I also don't think the SQL standard supports/mandates this.
I would be slightly in favor of removing the child.outputOrdering check.

I am not sure for a limit in the middle of a query, if we don't need to consider this. When such query has sort, don't we need to return top limit elements?

cc @cloud-fan too.

select * from table order by a limit 10 gets planned differently right? It should use TakeOrderedAndProjectExec.

There is nothing in the SQL standard that mandates that a nested order by followed by a limit has to respect that ordering clause. In fact, AFAIR, the standard does not even support nested limits (they make stuff non-deterministic).

If we end up supporting this, then I'd rather have an explicit flag in GlobalLimitExec (orderedLimit or something like that) and set that during planning by matching on Limit(limit, Sort(order, true, child)). I want the explicit flag because then we can figure out what limit is doing by looking at the physical plan. I want to explicitly check for an underlying sort to match the current TakeOrderedAndProjectExec semantics and to avoid weird behavior because something way down the plan has set some arbitrary ordering.

Ok. I got your point. As the SQL standard doesn't mandates that. I think we can safely remove the child.outputPartitioning check.

Let me open a follow up PR for it.

rxin · 2018-09-18T22:24:40Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

+  /**
+   * The number of outputs for the map task.
+   */
+  def numberOfOutput: Long


what does this mean? output blocks? output files?

rxin · 2018-09-18T22:30:27Z

two questions about this (i just saw this from a different place):

is numOutput about number of records?
how much memory usage will be increased by, for the driver, at scale?

hvanhovell · 2018-09-18T22:33:15Z

numOutputs is the number or records
8 bytes per MapStatus.

rxin · 2018-09-18T23:52:15Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

-    checkAnswer(
-      limit2Df.groupBy("id").count().select($"id"),
-      limit2Df.select($"id"))
+    withSQLConf(SQLConf.LIMIT_FLAT_GLOBAL_LIMIT.key -> "true") {


why do we set this flag here? we need to document it.

rxin · 2018-09-18T23:52:24Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala

  override def beforeAll(): Unit = {
    super.beforeAll()
    TestHive.setCacheTables(false)
+    TestHive.setConf(SQLConf.LIMIT_FLAT_GLOBAL_LIMIT, false)


why do we set this flag here? we need to document it.

viirya · 2018-09-18T23:58:20Z

@rxin Thanks for the comment. I will improve the document in a pr.

rxin · 2018-09-18T23:59:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

+          }
+        }
+      }
+      val broadMap = sparkContext.broadcast(takeAmountByPartition)


does "broad" here means broadcast? if yes, i don't think we have this convention in spark ...

btw why do we need to broadcast this?

Because we want the map to be sent to each node just only once?

Also let me change the variable name when improving the document.

we also broadcast closures automatically, don't we? so just putting a variable in a closure would accomplish this.

broadcast is more efficient if data size is big, because of TorrentBroadcast. What's our expectation of the data size here?

The size depends on the number of partitions. Each partition uses an int. If this is too small, we can remove the broadcast.

but tasks are already broadcasted

rxin · 2018-09-19T00:17:45Z

actually looking at the design - this could cause perf regressions in some cases too right? it introduces a barrier that was previously non-existent. if the number of records to take isn't substantially less than the actual records on each partition, perf would be much worse. also it feels to me this isn't shuffle at all, and we are piggybacking on the wrong infrastructure. what you really want is a way to buffer blocks temporarily, and can launch a 2nd wave of tasks to rerun some of them.

viirya · 2018-09-19T00:51:43Z

I'm not sure where it can cause perf regressions. Basically this just changes the way we retrieve records from partitions when performing limit. This doesn't do shuffling them together to single partition.

cloud-fan · 2018-09-19T02:48:41Z

Let me take an example from the PR description

For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions.

Without this patch, we need to take the first 100 rows from each partition, and then perform a shuffle to send all data into one partition and take the first 100 rows.

So if the limit is big, this patch is super useful, if the limit is small, this patch is not that useful but should not be slower.

The only overhead I can think of is, MapStatus needs to carry the numRecords metrics. It should be a small overhead, as MapStatus already carries many information.

rxin · 2018-09-19T04:45:09Z

ok after thinking about it more, i think we should just revert all of these changes and go back to the drawing board. here's why:

the prs change some of the most common/core parts of spark, and are not properly designed (as in they haven't gone through actual discussions; there's not even a doc on how they work). the prs created a much more complicated implementations for limit / top k. you might be able to justify the complexity with the perf improvements, but we better write them down, discuss them, and make sure they are the right design choices. we also need to explain the execution strategies for limit in comments. this is just a comment about the process, not the actual design.
now onto the design, i am having issues with two major parts:

2a. what this pr really wanted was an abstraction to buffer data, and then have the driver analyze some statistics about data (records per map task), and then make decisions. because spark doesn't yet have that infrastructure, this pr just adds some hacks to shuffle to make it work. there is no proper abstraction here.
2b. i'm not even sure if the algorithm here is the right one. the pr tries to parallelize as much as possible by keeping the same number of tasks. imo a simpler design that would work for more common cases is to buffer the data, get the records per map task, and create a new rdd with the first N number of partitions that reach limit. that way, we don't launch too many asks, and we retain ordering.

the pr implementation quality is poor. variable names are confusing (output vs records); it's severely lacking documentation; the doc for the config option is arcane.

sorry about all of the above, but we gotta do better.

cloud-fan · 2018-09-19T04:58:33Z

I'm convinced, there are 2 major issues:

abusing shuffle. we need a new mechanism for driver to analyze some statistics about data (records per map task)
too many small tasks. We need a better algorithm to decide the parallelism of limit.

viirya · 2018-09-19T07:47:17Z

I understood the two major concerns regarding this change. I'm going to submit a pr to revert the change. I will look into this idea further with new design.

## What changes were proposed in this pull request? This goes to revert sequential PRs based on some discussion and comments at #16677 (comment). #22344 #22330 #22239 #16677 ## How was this patch tested? Existing tests. Closes #22481 from viirya/revert-SPARK-19355-1. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 89671a2) Signed-off-by: Wenchen Fan <[email protected]>

sujith71955 · 2018-10-10T09:34:47Z

@viirya Are we also looking to optimize CollectLimitExec part? I saw in SparkPlan we have an executeTake() method which basically interpolate the number of partitions and processes the limit query. if driver analyze some statistics about data then i think even this algorithm we can optimize right.

viirya · 2018-10-10T10:08:10Z

@sujith71955 For executeTake, to optimize it we need to collect statistics of RDD. executeTake incrementally scans partitions. Ideally, it should just scan few partitions to return n rows, and remaining partitions can be skipped and don't need to be materialized. So going back to the beginning, IMHO, if we are going to collect the statistics, we will materialize all partitions, and that seems to be opposite to executeTake's optimization.

sujith71955 · 2018-10-10T10:23:02Z

@viirya I am having a usecase where a normal query is taking around 5 seconds where same query with limit 5000 is taking around 17 sec. when i was checking i could find bottleneck in the above mentioned flow.

sujith71955 · 2018-10-10T10:24:20Z

Mainly i think because we are trying to interpolate the number of partitions

viirya · 2018-10-10T11:44:50Z

@sujith71955 Thanks. I see. The case is somehow different with the problem this PR wants to solve. But I think it is a reasonable use case. May you want to create a ticket for us to track it?

sujith71955 · 2018-10-10T11:52:01Z

Yes sure , i will create a ticket for this issue and Keep you guys in loop. Thanks

viirya force-pushed the improve-global-limit-parallelism branch from e067b10 to 5dae2da Compare January 23, 2017 13:02

viirya force-pushed the improve-global-limit-parallelism branch from 5dae2da to 0205cd9 Compare January 23, 2017 15:39

viirya force-pushed the improve-global-limit-parallelism branch 2 times, most recently from 3dec117 to 0a2e96f Compare January 24, 2017 03:55

viirya force-pushed the improve-global-limit-parallelism branch from 0a2e96f to 4fb5e40 Compare January 24, 2017 11:57

viirya force-pushed the improve-global-limit-parallelism branch from 4fb5e40 to 9d4cadb Compare January 24, 2017 15:33

viirya changed the title ~~[WIP][SQL] Use map output statistices to improve global limit's parallelism~~ [SPARK-19355][SQL] Use map output statistices to improve global limit's parallelism Jan 25, 2017

viirya force-pushed the improve-global-limit-parallelism branch from 9d4cadb to 7f89c30 Compare January 25, 2017 00:59

scwf reviewed Jan 25, 2017

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated

Copy link

Contributor

scwf Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about LocalPartitioning

scwf reviewed Jan 25, 2017

View reviewed changes

viirya force-pushed the improve-global-limit-parallelism branch from 7f89c30 to def10e6 Compare January 25, 2017 05:34

Use map output statistices to improve global limit's parallelism.

4e31bb7

viirya force-pushed the improve-global-limit-parallelism branch from def10e6 to 4e31bb7 Compare January 25, 2017 05:56

asfgit closed this in 4f17585 Aug 10, 2018

cloud-fan reviewed Aug 25, 2018

View reviewed changes

hvanhovell reviewed Aug 25, 2018

View reviewed changes

viirya mentioned this pull request Aug 27, 2018

[SPARK-19355][SQL][Followup] Remove the child.outputOrdering check in global limit #22239

Closed

rxin reviewed Sep 18, 2018

View reviewed changes

This was referenced Sep 19, 2018

Revert [SPARK-19355][SPARK-25352] #22464

Closed

Revert [SPARK-19355][SPARK-25352] #22481

Closed

viirya deleted the improve-global-limit-parallelism branch December 27, 2023 18:21

[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism #16677

[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism #16677

Uh oh!

Conversation

viirya commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

viirya commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

viirya commented Jan 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

hvanhovell commented Aug 10, 2018

Uh oh!

viirya commented Aug 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Sep 18, 2018

viirya commented Jan 23, 2017 •

edited

Loading

viirya Aug 25, 2018 •

edited

Loading

cloud-fan commented Sep 19, 2018 •

edited

Loading

rxin commented Sep 19, 2018 •

edited

Loading