Skip reading Parquet pages using Column Indexes feature of Parquet 1.11#17216
Skip reading Parquet pages using Column Indexes feature of Parquet 1.11#17216shangxinli wants to merge 6 commits intoprestodb:masterfrom
Conversation
1cb54cd to
c8b834f
Compare
c8b834f to
d12e7ff
Compare
zhenxiao
left a comment
There was a problem hiding this comment.
nice work, @shangxinli
left some comments
recently did a parquet code refactor, to speedup ParquetTupleDomainPredicate building, could you please rebase on the recent master?
| List<ColumnIndexStore> blockIndexStores = new ArrayList<>(); | ||
| for (BlockMetaData block : footerBlocks.build()) { | ||
| if (predicateMatches(parquetPredicate, block, finalDataSource, descriptorsByPath, parquetTupleDomain, failOnCorruptedParquetStatistics)) { | ||
| ColumnIndexStore ciStore = getColumnIndexStore(parquetPredicate, finalDataSource, block, descriptorsByPath, readColumnIndexFilter); |
There was a problem hiding this comment.
s/ciStore/columnIndexStore/g
There was a problem hiding this comment.
Just replaced them
| return getParquetType(prestoType, messageType, column, tableName, path); | ||
| } | ||
|
|
||
| private static ColumnIndexStore getColumnIndexStore(Predicate parquetPredicate, ParquetDataSource dataSource, BlockMetaData blockMetadata, Map<List<String>, RichColumnDescriptor> descriptorsByPath, boolean readColumnIndexFilter) |
There was a problem hiding this comment.
return Optional, instead of null?
There was a problem hiding this comment.
We talked earlier with @beinan about whether we should return optional or null for this methood. For this use case even we use optional, we still need to check .isPresent(). It is pretty much similar with null checking and not much value to do so. Here are some good explanations about null checking and Optional checking https://medium.com/javarevisited/null-check-vs-optional-are-they-same-c361d15fade3.
There was a problem hiding this comment.
get it. I think we try to use Optional as much as possible, trying to get rid of nullPointer exceptions. What do you think @beinan ?
There was a problem hiding this comment.
I am fine to replace it with Optional. But when I try it and see we still need to end up with null checking. This is because ColumnIndexStore.java that defined in Parquet repo is using null checking. So the class ParquetColumnIndexStore extends ColumnIndexStore need to have the same signature. So if we replace it with Optional, it ends up that we convert from null checking to Optional.empty() check in HDFSParquetDataSource and have to convert to Optiona.empty() to null checking in ParquetColumnIndexStore. Since we only use empty() API in Optional, as that article mentioned, null checking and empty() is not much different in this case. Thoughts?
There was a problem hiding this comment.
get it. yep, bad Presto and Parquet are having different coding styles.
how about we try Optional as much as possible in Presto code? This reminds me old days when try to implement new Parquet reader for Presto :)
There was a problem hiding this comment.
I have the same feeling, I also suggest use optional in the old PRs.
| TupleDomain<DeltaColumnHandle> effectivePredicate, | ||
| FileFormatDataSourceStats stats) | ||
| FileFormatDataSourceStats stats, | ||
| boolean readColumnIndexFilter) |
There was a problem hiding this comment.
s/readColumnIndexFilter/columnIndexFilterEnabled/g
| } | ||
|
|
||
| boolean hasColumnIndex = false; | ||
| for (ColumnChunkMetaData column : blockMetadata.getColumns()) { |
There was a problem hiding this comment.
could we rewrite this with stream api? use anyMatch?
| return this.materializedViewMissingPartitionsThreshold; | ||
| } | ||
|
|
||
| @Config("hive.parquet-use-column-index-filter") |
There was a problem hiding this comment.
how about:
hive.parquet-column-index-filter-enabled?
| valueCount += readDataPageV1(pageHeader, uncompressedPageSize, compressedPageSize, pages); | ||
| firstRowIndex = PageReader.getFirstRowIndex(dataPageCount, offsetIndex); | ||
| valueCount += readDataPageV1(pageHeader, uncompressedPageSize, compressedPageSize, firstRowIndex, pages); | ||
| ++dataPageCount; |
There was a problem hiding this comment.
s/++dataPageCount/dataPageCount = dataPageCount + 1/g
| return dataHeaderV2.getNum_values(); | ||
| } | ||
|
|
||
| private boolean hasMorePages(long valuesCountReadSoFar, int dataPageCountReadSoFar) |
There was a problem hiding this comment.
s/valuesCountReadSoFar/valuesCount/g
s/dataPageCountReadSoFar/pagesCount/g
| currentBlock = currentBlock + 1; | ||
|
|
||
| if (filter != null && readColumnIndexFilter) { | ||
| ColumnIndexStore ciStore = blockIndexStores.get(currentBlock); |
There was a problem hiding this comment.
s/ciStore/columnIndexStore/g
| OffsetIndex offsetIndex = blockIndexStores.get(currentBlock).getOffsetIndex(metadata.getPath()); | ||
| OffsetIndex filteredOffsetIndex = ColumnIndexFilterUtils.filterOffsetIndex(offsetIndex, currentGroupRowRanges, blocks.get(currentBlock).getRowCount()); | ||
| List<OffsetRange> offsetRanges = ColumnIndexFilterUtils.calculateOffsetRanges(filteredOffsetIndex, metadata, offsetIndex.getOffset(0), startingPosition); | ||
| List<ConsecutivePartList> allParts = concatRanges(offsetRanges); |
There was a problem hiding this comment.
s/allParts/offsetRanges/g
There was a problem hiding this comment.
We already have the variable offsetRanges. We concat it.
There was a problem hiding this comment.
get it. how about:
consecutiveRanges?
generally, we do not use variable abbreviation in Presto
| * Describes a list of consecutive parts to be read at once. A consecutive part may contain whole column chunks or | ||
| * only parts of them (some pages). | ||
| */ | ||
| private class ConsecutivePartList |
There was a problem hiding this comment.
s/ConsecutivePartList/PageRanges/g
There was a problem hiding this comment.
This name is from Parquet. Better to keep the same name? What do you think?
There was a problem hiding this comment.
get it. We could either add in comments above, describing the private class is from Parquet, or rename it to PageRanges
There was a problem hiding this comment.
I just replace it with PageRanges
Yes, did it. Thanks for letting me know. |
25f3e5b to
6760bb1
Compare
563a586 to
6bccc1e
Compare
zhenxiao
left a comment
There was a problem hiding this comment.
thank you, @shangxinli
mostly looks nice. some minor things
...parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java
Show resolved
Hide resolved
| * values (and the related rl and dl) for the rows [20, 39] in the end of the page 0 for col2. Similarly, we have to | ||
| * skip values while reading page0 and page1 for col3. | ||
| */ | ||
| private void processValuesSync(int valuesToRead, Consumer<Void> valueConsumer) |
There was a problem hiding this comment.
am still inclined to merge the two functions. The signatures are the same. New function looks like:
private void processValues(int valuesToRead, Consumer<Void> valueConsumer, boolean indexEnabled)
| } | ||
| } | ||
|
|
||
| // Used for columns are not in this parquet file |
There was a problem hiding this comment.
we could remove this comment. Or,
for columns not in this parquet file
There was a problem hiding this comment.
We can remove the comments
| OffsetIndex offsetIndex = blockIndexStores.get(currentBlock).getOffsetIndex(metadata.getPath()); | ||
| OffsetIndex filteredOffsetIndex = ColumnIndexFilterUtils.filterOffsetIndex(offsetIndex, currentGroupRowRanges, blocks.get(currentBlock).getRowCount()); | ||
| List<OffsetRange> offsetRanges = ColumnIndexFilterUtils.calculateOffsetRanges(filteredOffsetIndex, metadata, offsetIndex.getOffset(0), startingPosition); | ||
| List<ConsecutivePartList> allParts = concatRanges(offsetRanges); |
There was a problem hiding this comment.
get it. how about:
consecutiveRanges?
generally, we do not use variable abbreviation in Presto
| * Describes a list of consecutive parts to be read at once. A consecutive part may contain whole column chunks or | ||
| * only parts of them (some pages). | ||
| */ | ||
| private class ConsecutivePartList |
There was a problem hiding this comment.
get it. We could either add in comments above, describing the private class is from Parquet, or rename it to PageRanges
94c2ffa to
e94aada
Compare
zhenxiao
left a comment
There was a problem hiding this comment.
looks nice, @shangxinli
one minor issue
could you please squash all commits into one, and add release note?
also, make sure all tests passed
There was a problem hiding this comment.
let's add message in the exception, like:
parquet type not supported: %s
beinan
left a comment
There was a problem hiding this comment.
lgtm, I think reviewed most of the code change in the previous PRs. Thank you @shangxinli ! Looking forward to seeing more contribution from parquet community! Thanks!
|
rerun the tests |
|
We don't do merge commits, please rebase your changes onto master and squash all changes into one commit. Thanks! |
ece19ac to
47bfd42
Compare
|
@rongrong That means I need to recreate a new PR? |
|
Created a new PR #17284 to rebase due to so manage conflicts. |
Test plan - (Please fill in how you tested your changes)
Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.
Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.
If release note is NOT required, use: