-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54972][SQL] Improve NOT IN subqueries with non-nullable columns #53733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54972][SQL] Improve NOT IN subqueries with non-nullable columns #53733
Conversation
JIRA Issue Information=== Improvement SPARK-54972 === This comment was automatically generated by GitHub Actions |
| // positive not in subquery case | ||
| var joinExec = assertJoin(( | ||
| "select * from testData where key not in (select a from testData2)", | ||
| "select * from testData where key not in (select b from testData3)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testData2 columns are not nullable, but we need a nullable column to keep this test valid (assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)).
|
cc @cloud-fan , @dongjoon-hyun |
11e8cf1 to
6961295
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Thank you for pinging me, @peter-toth . The PR itself looks reasonable to me. Could you simply re-trigger the CI failure because it looks irrelevant to this?
- SPARK-47148: AQE should avoid to submit shuffle job on cancellation *** FAILED *** (6 seconds, 125 milliseconds)
|
Thanks @dongjoon-hyun for the review. Merged to |
### What changes were proposed in this pull request? Run `NullPropagation` after NOT IN subquery rewrite. ### Why are the changes needed? NOT IN subqueries like `SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2)` are rewritten as left anti join `t1.c = t2.c` with additional `OR IsNull(t1.c = t2.c)` conditions which prevents equi join implementations to be used so those joins end up as `BroadcastNestedLoopJoin`. When we know the columns can't be null, we can either drop those additional conditions during subquery rewrite or call `NullPropagation` after the rewrite to simplify them to `false`. This PR contains the latter. Please note that apache#29104 already optmized the single column NOT IN subqueries from `BroadcastNestedLoopJoin` to "null aware" `BroadcastHashJoin` very well, but when the columns are not nullable we can optimize multi column cases as well and the join don't need to be "null aware". ### Does this PR introduce _any_ user-facing change? Yes, performance improvement. ### How was this patch tested? A new UTs was added and some exsisting tests were adjusted to keep their validity. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53733 from peter-toth/SPARK-54972-improve-not-in-with-non-nullables. Authored-by: Peter Toth <[email protected]> Signed-off-by: Peter Toth <[email protected]>
What changes were proposed in this pull request?
Run
NullPropagationafter NOT IN subquery rewrite.Why are the changes needed?
NOT IN subqueries like
SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2)are rewritten as left anti joint1.c = t2.cwith additionalOR IsNull(t1.c = t2.c)conditions which prevents equi join implementations to be used so those joins end up asBroadcastNestedLoopJoin. When we know the columns can't be null, we can either drop those additional conditions during subquery rewrite or callNullPropagationafter the rewrite to simplify them tofalse. This PR contains the latter.Please note that #29104 already optmized the single column NOT IN subqueries from
BroadcastNestedLoopJointo "null aware"BroadcastHashJoinvery well, but when the columns are not nullable we can optimize multi column cases as well and the join don't need to be "null aware".Does this PR introduce any user-facing change?
Yes, performance improvement.
How was this patch tested?
A new UTs was added and some exsisting tests were adjusted to keep their validity.
Was this patch authored or co-authored using generative AI tooling?
No.