-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-3711: Optimize where in clause filter queries #2561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
ok to test |
|
Test FAILed. |
|
QA tests have started for PR 2561 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this is an optimized version of In for when all values of the inList are static.
|
Nice optimization :) A few minor suggestions only. |
|
QA tests have finished for PR 2561 at commit
|
…se to Optimizer.scala by adding a rule. Add appropriate comments
|
Updated the branch by incorporating review comments.
|
|
QA tests have started for PR 2561 at commit
|
|
Test FAILed. |
|
QA tests have finished for PR 2561 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why special handling for UnaryMinus? If the underlying thing is a literal this will get constant folded away.
|
Another thing is there should be some tests for this. Probably in |
…Suite
2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause
|
QA tests have started for PR 2561 at commit
|
|
|
QA tests have finished for PR 2561 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should actually only be indented 2 spaces (and I'd include a blank line after). Only wrapped arguments are indented 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also double check that there isn't a problem if a null is in the in list. I'm not sure if thats actually valid SQL (and it should never change the result), but we shouldn't throw an exception.
|
QA tests have started for PR 2561 at commit
|
|
|
QA tests have finished for PR 2561 at commit
|
|
Test FAILed. |
2. Fix optimization condition
3. Add tests for null in filter list
4. Add test case that optimization is not triggered in case of attributes in filter list
|
QA tests have started for PR 2561 at commit
|
|
QA tests have finished for PR 2561 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be OptimizeIn.
|
Thanks for implementing this! :) I going to make the two final small changes myself before merging. Just wanted to comment for future reference. |
The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list. Author: Yash Datta <[email protected]> Closes #2561 from saucam/branch-1.1 and squashes the following commits: 4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order 2. Fix optimization condition 3. Add tests for null in filter list 4. Add test case that optimization is not triggered in case of attributes in filter list afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in ExpressionEvaluationSuite 2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause 0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by constantFolding bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move optimization of In clause to Optimizer.scala by adding a rule. Add appropriate comments 430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of negative values as well bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries
|
Oh, also, in the future please open pull requests against master not a specific branch. We'll back port them as needed. |
|
Thanks a lot for all the help with the merge :) Will surely keep all the pointers in mind for the next time :) |
The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list.