Skip reading Parquet pages using Column Indexes feature.#17284
Skip reading Parquet pages using Column Indexes feature.#17284beinan merged 1 commit intoprestodb:masterfrom
Conversation
c87451b to
78295ee
Compare
beinan
left a comment
There was a problem hiding this comment.
Thank you @shangxinli for you great work and rebase! Just two minor var naming issue, I guess you might miss these two places during the rebase
There was a problem hiding this comment.
If I change to 'setColumnIndexFilterEnabled' it always causes some odd test failures that seem unrelated. I changed it to setColumnIndexFilter
There was a problem hiding this comment.
It seems even we change to setColumnIndexFilterEnabled still doesn't work ether. Just keep the original name for now.
zhenxiao
left a comment
There was a problem hiding this comment.
@shangxinli nice work
A few more comments
we will speedup to merge this PR
There was a problem hiding this comment.
s/canDropCanWithRangeStats/canDropWithRangeStatistics/g
There was a problem hiding this comment.
return columnDomain.intersect(domain).isNone();
There was a problem hiding this comment.
s/ciConversions/conversions/g
There was a problem hiding this comment.
currentRow = currentRow + 1;
presto-parquet/src/main/java/com/facebook/presto/parquet/reader/AbstractColumnReader.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/main/java/com/facebook/presto/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
could we move pageIndex = pageIndex + 1 out of the function call?
There was a problem hiding this comment.
the comment is not useful. let's remove it
There was a problem hiding this comment.
should we consider merge PageRanges with OffsetRange into one class? They are quite similar
There was a problem hiding this comment.
It seems the same but we have to convert between int/long.
There was a problem hiding this comment.
yes, shall we use merge the two classes?
There was a problem hiding this comment.
The variable 'length' is int in PageRanges but long in OffsetRanges. If we merge, we need to cast, which we should generally avoid casting as much as we can.
There was a problem hiding this comment.
I am inclined to merge PageRanges into OffsetRange. We are casting OffsetRange length to int in code below. Doing the merge could remove duplicate code, and save the cast, too.
There was a problem hiding this comment.
s/rangeStartPos/startPosition/g
zhenxiao
left a comment
There was a problem hiding this comment.
@shangxinli mostly good
a few remaining minor things
presto-parquet/src/main/java/com/facebook/presto/parquet/reader/AbstractColumnReader.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/main/java/com/facebook/presto/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
yes, shall we use merge the two classes?
d3e1180 to
f8d63e2
Compare
zhenxiao
left a comment
There was a problem hiding this comment.
hi @shangxinli mostly look
only 2 remaining minor issues. others looks good to me
There was a problem hiding this comment.
we are casting OffsetRange length to int, shall we use OffsetRange directly?
There was a problem hiding this comment.
I am inclined to merge PageRanges into OffsetRange. We are casting OffsetRange length to int in code below. Doing the merge could remove duplicate code, and save the cast, too.
|
Addressed the last two comments from @zhenxiao. |
b5164a7 to
cde2032
Compare
|
looks good to me |
|
Yeah, I did. But the test keep failing with the error "didn't finish within the time-out 60000". I revert it but still failed. It shouldn't be related to the change itself. I will add back the last commit and squash. |
ce3b0e4 to
0a98e6d
Compare
Port some code from parquet-mr repo https://github.com/apache/parquet-mr Co-authored-by: Gabor Szadovszky <gabor.szadovszky@cloudera.com> More details about Parquet Column Indexes feature: Column Indexes also named as page level indexes that have min/max values for each page in a given column chunk. When reading pages, a reader doesn't need to process the page header to determine whether the page could be skipped based on the statistics. More information about this feature can be found https://github.com/apache/parquet-format/blob/master/PageIndex.md
0a98e6d to
58db92d
Compare
|
Does this break the build? Looks like a real version of junit, maybe something wrong @ Central? |
hmmm, very likely, let me remove the real junit version. Just posted a PR to fix it #17334 @aweisberg |
Port some code from parquet-mr repo https://github.com/apache/parquet-mr
Co-authored-by: Gabor Szadovszky gabor.szadovszky@cloudera.com
More details about Parquet Column Indexes feature:
Column Indexes also named as page level indexes that have min/max values for each page in a given column chunk. When reading pages, a reader doesn't need to process the page header to determine whether the page could be skipped based on the statistics. More information about this feature can be found https://github.com/apache/parquet-format/blob/master/PageIndex.md
Test plan - (Please fill in how you tested your changes)
This feature was tested in the Uber staging environment and then rolled out to production for 5+ months.
Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.