-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism #16677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e067b10 to
5dae2da
Compare
|
Test build #71833 has finished for PR 16677 at commit
|
|
retest this please. |
5dae2da to
0205cd9
Compare
|
Test build #71842 has finished for PR 16677 at commit
|
|
Test build #71856 has finished for PR 16677 at commit
|
|
Test build #71848 has finished for PR 16677 at commit
|
3dec117 to
0a2e96f
Compare
|
Test build #71901 has finished for PR 16677 at commit
|
|
Test build #71905 has finished for PR 16677 at commit
|
0a2e96f to
4fb5e40
Compare
|
Test build #71931 has finished for PR 16677 at commit
|
4fb5e40 to
9d4cadb
Compare
|
Test build #71936 has finished for PR 16677 at commit
|
|
also cc @cloud-fan and @hvanhovell |
9d4cadb to
7f89c30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about LocalPartitioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think here we should use the shuffle rdd to directly read the data from disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getOrElse(empty iter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we won't reach here, but the change is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its better to broadcast reduceAmounts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i have thought it before. forget to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move the logical of construct the shuffled rdd to ShuffleExchange and in global limit we begin with the shuffle rdd.
- add a new
Distributionfor fake partitioning - modify the
ShuffledRowRDDto carry the row num of each partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, I am more conservative. Because currently there are no other operators using this feature. So I would tend to not change ShuffleExchange right now.
|
Test build #71955 has finished for PR 16677 at commit
|
7f89c30 to
def10e6
Compare
|
Test build #71972 has started for PR 16677 at commit |
def10e6 to
4e31bb7
Compare
|
Test build #71973 has started for PR 16677 at commit |
|
Merging to master. Thanks! |
|
Thank you! @hvanhovell |
| -- A test suite for IN LIMIT in parent side, subquery, and both predicate subquery | ||
| -- It includes correlated cases. | ||
|
|
||
| -- Disable global limit optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disables the optimization to get the limited values exactly the same as the current golden results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, thanks!
| // partitions. If disabled, scanning data partitions sequentially until reaching limit number. | ||
| // Besides, if child output has certain ordering, we can't evenly pick up rows from | ||
| // each parititon. | ||
| val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && child.outputOrdering == Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya dumb question, what is child.outputOrdering doing here? I am not entirely sure that we should guarantee that you should get the lowest elements of a dataset if you perform a limit in the middle of a query (a top level sort-limit does have this guarantee). I also don't think the SQL standard supports/mandates this.
Moreover checking child.outputOrdering only checks the order of the partition and not the order of the frame as a whole. You should also add the child.outputPartitioning.
I would be slightly in favor of removing the child.outputOrdering check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove it, we may need to feature flag it first since people may rely on the old behavior. Anyway all of this is up for debate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a query like select * from table order by a limit 10, I think the expected semantics is going to return top 10 elements, not any 10 elements. In order to not change this behavior, I add this check.
Moreover checking child.outputOrdering only checks the order of the partition and not the order of the frame as a whole. You should also add the child.outputPartitioning.
I think you are correct. We need to check child.outputPartitioning. I think we need to check there is a RangePartitioning. The check should be the child is a range partitioning and has some output ordering. WDYT?
I am not entirely sure that we should guarantee that you should get the lowest elements of a dataset if you perform a limit in the middle of a query (a top level sort-limit does have this guarantee). I also don't think the SQL standard supports/mandates this.
I would be slightly in favor of removing the child.outputOrdering check.
I am not sure for a limit in the middle of a query, if we don't need to consider this. When such query has sort, don't we need to return top limit elements?
cc @cloud-fan too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select * from table order by a limit 10 gets planned differently right? It should use TakeOrderedAndProjectExec.
There is nothing in the SQL standard that mandates that a nested order by followed by a limit has to respect that ordering clause. In fact, AFAIR, the standard does not even support nested limits (they make stuff non-deterministic).
If we end up supporting this, then I'd rather have an explicit flag in GlobalLimitExec (orderedLimit or something like that) and set that during planning by matching on Limit(limit, Sort(order, true, child)). I want the explicit flag because then we can figure out what limit is doing by looking at the physical plan. I want to explicitly check for an underlying sort to match the current TakeOrderedAndProjectExec semantics and to avoid weird behavior because something way down the plan has set some arbitrary ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I got your point. As the SQL standard doesn't mandates that. I think we can safely remove the child.outputPartitioning check.
Let me open a follow up PR for it.
| /** | ||
| * The number of outputs for the map task. | ||
| */ | ||
| def numberOfOutput: Long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean? output blocks? output files?
|
two questions about this (i just saw this from a different place):
|
|
| checkAnswer( | ||
| limit2Df.groupBy("id").count().select($"id"), | ||
| limit2Df.select($"id")) | ||
| withSQLConf(SQLConf.LIMIT_FLAT_GLOBAL_LIMIT.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we set this flag here? we need to document it.
| override def beforeAll(): Unit = { | ||
| super.beforeAll() | ||
| TestHive.setCacheTables(false) | ||
| TestHive.setConf(SQLConf.LIMIT_FLAT_GLOBAL_LIMIT, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we set this flag here? we need to document it.
|
@rxin Thanks for the comment. I will improve the document in a pr. |
| } | ||
| } | ||
| } | ||
| val broadMap = sparkContext.broadcast(takeAmountByPartition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does "broad" here means broadcast? if yes, i don't think we have this convention in spark ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw why do we need to broadcast this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want the map to be sent to each node just only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also let me change the variable name when improving the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also broadcast closures automatically, don't we? so just putting a variable in a closure would accomplish this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadcast is more efficient if data size is big, because of TorrentBroadcast. What's our expectation of the data size here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size depends on the number of partitions. Each partition uses an int. If this is too small, we can remove the broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but tasks are already broadcasted
|
actually looking at the design - this could cause perf regressions in some cases too right? it introduces a barrier that was previously non-existent. if the number of records to take isn't substantially less than the actual records on each partition, perf would be much worse. also it feels to me this isn't shuffle at all, and we are piggybacking on the wrong infrastructure. what you really want is a way to buffer blocks temporarily, and can launch a 2nd wave of tasks to rerun some of them. |
|
I'm not sure where it can cause perf regressions. Basically this just changes the way we retrieve records from partitions when performing limit. This doesn't do shuffling them together to single partition. |
|
Let me take an example from the PR description
Without this patch, we need to take the first 100 rows from each partition, and then perform a shuffle to send all data into one partition and take the first 100 rows. So if the limit is big, this patch is super useful, if the limit is small, this patch is not that useful but should not be slower. The only overhead I can think of is, |
|
ok after thinking about it more, i think we should just revert all of these changes and go back to the drawing board. here's why:
sorry about all of the above, but we gotta do better. |
|
I'm convinced, there are 2 major issues:
|
|
I understood the two major concerns regarding this change. I'm going to submit a pr to revert the change. I will look into this idea further with new design. |
## What changes were proposed in this pull request? This goes to revert sequential PRs based on some discussion and comments at #16677 (comment). #22344 #22330 #22239 #16677 ## How was this patch tested? Existing tests. Closes #22481 from viirya/revert-SPARK-19355-1. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 89671a2) Signed-off-by: Wenchen Fan <[email protected]>
|
@viirya Are we also looking to optimize CollectLimitExec part? I saw in SparkPlan we have an executeTake() method which basically interpolate the number of partitions and processes the limit query. if driver analyze some statistics about data then i think even this algorithm we can optimize right. |
|
@sujith71955 For |
|
@viirya I am having a usecase where a normal query is taking around 5 seconds where same query with limit 5000 is taking around 17 sec. when i was checking i could find bottleneck in the above mentioned flow. |
|
Mainly i think because we are trying to interpolate the number of partitions |
|
@sujith71955 Thanks. I see. The case is somehow different with the problem this PR wants to solve. But I think it is a reasonable use case. May you want to create a ticket for us to track it? |
|
Yes sure , i will create a ticket for this issue and Keep you guys in loop. Thanks |
What changes were proposed in this pull request?
A logical
Limitis performed physically by two operationsLocalLimitandGlobalLimit.Most of time, we gather all data into a single partition in order to run
GlobalLimit. If we use a very big limit number, shuffling data causes performance issue also reduces parallelism.We can avoid shuffling into single partition if we don't care data ordering. This patch implements this idea by doing a map stage during global limit. It collects the info of row numbers at each partition. For each partition, we locally retrieves limited data without any shuffling to finish this global limit.
For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions.
If the data partition has certain ordering, we can't distribute required rows evenly to each partitions because it could change data ordering. But we still can avoid shuffling.
How was this patch tested?
Jenkins tests.