-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-12505] [SQL] Pushdown a Limit on top of an Outer-Join #10454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,6 +47,7 @@ object DefaultOptimizer extends Optimizer { | |
| PushPredicateThroughProject, | ||
| PushPredicateThroughGenerate, | ||
| PushPredicateThroughAggregate, | ||
| PushLimitThroughOuterJoin, | ||
| ColumnPruning, | ||
| // Operator combine | ||
| ProjectCollapsing, | ||
|
|
@@ -857,6 +858,30 @@ object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper { | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * Push [[Limit]] operators through [[Join]] operators, iff the join type is outer joins. | ||
| * Adding extra [[Limit]] operators on top of the outer-side child/children. | ||
| */ | ||
| object PushLimitThroughOuterJoin extends Rule[LogicalPlan] with PredicateHelper { | ||
|
|
||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| case f @ Limit(expr, Join(left, right, joinType, joinCondition)) => | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also can push it down through the |
||
| joinType match { | ||
| case RightOuter => | ||
| Limit(expr, Join(left, CombineLimits(Limit(expr, right)), joinType, joinCondition)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need a stop condition to stop pushing
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your review! Since we call
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, it's right, nvm
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is right. However I think checking if it is already pushed would reduce unnecessary multiple applying this rule and |
||
| case LeftOuter => | ||
| Limit(expr, Join(CombineLimits(Limit(expr, left)), right, joinType, joinCondition)) | ||
| case FullOuter => | ||
| Limit(expr, | ||
| Join( | ||
| CombineLimits(Limit(expr, left)), | ||
| CombineLimits(Limit(expr, right)), | ||
| joinType, joinCondition)) | ||
| case _ => f // DO Nothing for the other join types | ||
| } | ||
| } | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible that we add an extra limit in every iteration ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, but, after this rule, there exists another rule |
||
|
|
||
| /** | ||
| * Removes [[Cast Casts]] that are unnecessary because the input is already the correct type. | ||
| */ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we will generate wrong results. It is not safe to push down the limit if we are not joining on the foreign key with the primary key, right? For example, for a left outer join, we push down the limit to the right table. It is possible all rows returned by the right side are having the same join column value.
am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For left outer joins, we only push down the limit to the left table. Thus, I think it should be safe?
Basically, the rule is to add additional Limit node(s) on top of the outer-side child/children
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry. I was looking at the wrong line. For full outer join, is it safe? Also, with the limit, the result of left/right outer may not be deterministic, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. Actually, for left/right outer join, what will happen if we have
A left outer join B on (A.key = B.key) sort by A.key limit 10?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a sort, the plan is different. We will have a
SortbelowLimit. Thus, it is not applicable to this rule.For the full-outer join, I think the result is still correct, but the result might not be the same as the original plan.
In the new plan, we add extra
limit. If the nodes belowlimitis deterministic, the result should be deterministic. If not deterministic, the result will be not deterministic. Right? Thus, this rule will not change the results' deterministic property. Sorry, this part is unclear to me. I am not sure my answer is correct.Based on my understanding, when we use
limit, the users will not expect a deterministic results, unless they useorder by.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I found a hole.
full outer = union (left outer, right outer). We are unable to add extralimitbelowUnion, which removes the duplicates.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, we have the a table A having 1, 2, 3, 4, 5 (say k is the column name). If we do
A x FULL OUTER JOIN A y ON (x.k = y.k) limit 2and we push limit to both side, it is possible that we get1,2from the left side and3, 4from the right side, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me update the code and fix the bug. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the
full outerjoin, my idea is to add extralimitto one side which has a higherstatistics. Does that sound good?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to refine the idea.
For the
full outerjoin,limit. Do nothing.limit, add extralimitto that side.limit, add extralimitto one side which has a higherstatisticsIs it better?