-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet #8956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #43149 has finished for PR 8956 at commit
|
|
cc @liancheng |
|
Test build #43213 has finished for PR 8956 at commit
|
|
retest this please. |
|
Test build #43215 has finished for PR 8956 at commit
|
|
Test build #43238 has finished for PR 8956 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove ()
|
Off topic but related, private val min = valueSet.min
private val max = valueSet.max
override def canDrop(statistics: Statistics[T]): Boolean = {
statistics.getMax.compareTo(min) < 0 || max.compareTo(statistics.getMin) < 0
}(And we probably should rename |
|
@liancheng Thank you for your detailed comments. I've updated this patch. When the tests are passed, please review it again to see if there is any problem. |
|
Test build #43878 has finished for PR 8956 at commit
|
|
retest this please. |
|
Test build #43880 has finished for PR 8956 at commit
|
|
Are there any performance improvements by pushing this down? |
|
I can run some performance tests later. |
|
Thanks - that'd be great. |
|
Sorry I am in travel. I will submit the test few days after. |
|
@viirya do you mind closing this and reopening it when it's ready? |
|
Sure. |
|
@rxin I am curious that although I don't observe significant performance improvement from a simple projection + filter operation by now with simple experiment, by making this filters pushed down to Parquet side, do we retrieve less data and reduce the memory footprint? If so, even under the same performance level, is this patch still worth merging? |
|
If we don't observe performance improvements, it's definitely not worth it. Can you post your how you measured it, and performance results? Thanks. |
|
ok. Thanks. Because we found that with pushdown filters, we can avoid the OOM problem when processing large data in our daily usage. I am wondering if it is helpful to others too. I will post the the performance test later. |
|
How does pushdown avoid OOM? |
|
Because we can pre-filtering the data? Without pushdown, the whole data will be loaded into memory and then has been filtered later. |
|
Is that the case? I thought we load them one by one (or small batch at a time) and then apply the filter directly on them? |
|
Hmm, I am not sure about that. Because I supposed that Parquet relation will read all data first if no pushdown filters are applied. Then Spark SQL's |
|
Well, it depends. The situation is a little bit tricky to explain. In general there are two cases:
|
|
Thank you @liancheng for clear explanation! So looks like the only benefit of this patch is the reduced memory footprint under certain cases. If you all think it is not worth merging this, we should keep it closed. |
|
@liancheng ok. Thank you. |
JIRA: https://issues.apache.org/jira/browse/SPARK-10895