Support Parquet TupleDomain using ColumnDescriptor#6892
Support Parquet TupleDomain using ColumnDescriptor#6892nezihyigitbasi merged 1 commit intoprestodb:masterfrom
Conversation
|
@nezihyigitbasi @dain this is a first step to support nested predicate pushdown for Parquet in Presto. This PR is to make Parquet TupleDomain based on Parquet ColumnDescriptor, so that when nested predicates are pushed down, it could easily work |
0dba438 to
eef5773
Compare
|
@dain @nezihyigitbasi kindly ping |
eef5773 to
fb3cf7a
Compare
2dfa193 to
470f87a
Compare
b3aec2e to
b8e275f
Compare
|
@dain @nezihyigitbasi kindly ping |
b8e275f to
71e17e6
Compare
|
@dain @nezihyigitbasi could you please review when you are free? |
|
@nezihyigitbasi @dain kindly ping |
1 similar comment
|
@nezihyigitbasi @dain kindly ping |
|
In an offline discussion we decided that @zhenxiao was going to investigate using synthetic virtual-columns in the connector to enable this push-down feature. |
|
Hi @dain: Could you please review this first? |
|
Ah ok. Will do. |
|
@dain @nezihyigitbasi kindly ping |
|
@zhenxiao I will take a look at this shortly. |
nezihyigitbasi
left a comment
There was a problem hiding this comment.
@zhenxiao I did a first pass, back to you.
There was a problem hiding this comment.
thank you, get it resolved
There was a problem hiding this comment.
now that the typeManager is not used anymore in this method we can remove it.
There was a problem hiding this comment.
static import Map.Entry
There was a problem hiding this comment.
This class already has access to the column descriptor so we shouldn't need to pass the Presto type as well since Presto type can be derived from that (like you do in getPrestoType()). So I guess we can move getPrestoType() logic to this class.
There was a problem hiding this comment.
get it. this class not needed at all. Will delete and use RichColumnDescriptor directly
There was a problem hiding this comment.
@param statistics column statistics
There was a problem hiding this comment.
I don't think you need the paths variable, you can just pass Arrays.asList(columnMetaData.getPath().toArray()) below.
There was a problem hiding this comment.
Although not entirely related to this patch, why are we ignoring the IOException here? If we get an exception we will silently skip populating dictionaries for that particular column.
There was a problem hiding this comment.
if hitting any problem when trying to read the dictionary page, we will just silently skip, and will not use that dictionary to try skip reading row groups
There was a problem hiding this comment.
OK as long as we are not messing up while we read.
There was a problem hiding this comment.
What will happen if the descriptor is not present and we return an empty map of dictionaries?
There was a problem hiding this comment.
only build dictionaries when descriptor exist, otherwise, dictionaries is empty, will scan file, dictionary predicate will not apply
There was a problem hiding this comment.
getDomains() returns an Optional, are we sure it's non-empty? If yes, we should better assert that (com.google.common.base.Verify::verify).
There was a problem hiding this comment.
throw new PrestoException(NOT_SUPPORTED, "Unsupported parquet type: " + descriptor.getType());
71e17e6 to
bbed115
Compare
|
thank you @nezihyigitbasi |
nezihyigitbasi
left a comment
There was a problem hiding this comment.
LGTM except some minor comments. BTW do we need similar changes in the new reader for nested predicate pushdown support?
There was a problem hiding this comment.
now that we have this method in ParquetTypeUtils we can update ParquetColumnReader to also call this one instead of maintaining two copies.
There was a problem hiding this comment.
yep, get it updated
There was a problem hiding this comment.
"parquetTupleDomain is empty"
bbed115 to
5eda135
Compare
|
thank you @nezihyigitbasi get comments addressed |
|
@dain I think this looks good. |
dain
left a comment
There was a problem hiding this comment.
Just one minor comment. Otherwise, Nezih, merge whenever you want.
There was a problem hiding this comment.
Can we rename i and j to something more descriptive? Maybe columnIndex and level (and rename level to maxLevel)?
5eda135 to
ff5cd08
Compare
|
thank you @dain @nezihyigitbasi |
|
thanks @zhenxiao I will merge this once the tests all pass. |
|
Hi @zhenxiao , |
|
Hi @shurvitz thanks for reaching out. |
|
Get it. Thank you, @shurvitz @pettyjamesm |
Currently Parquet TupleDomain is constructed based on HiveColumnHandle. This would not work if Nested predicate are pushed down, e.g.
This patch construct Parquet TupleDomain with Parquet's ColumnDescriptor, so that it could work with nested predicate pushdown