Skip reading Parquet pages using Column Indexes feature of Parquet 1.11 by shangxinli · Pull Request #17216 · prestodb/presto

shangxinli · 2022-01-23T00:26:04Z

Test plan - (Please fill in how you tested your changes)

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* ...
* ...

Hive Changes
* ...
* ...

If release note is NOT required, use:

== NO RELEASE NOTE ==

zhenxiao

nice work, @shangxinli
left some comments
recently did a parquet code refactor, to speedup ParquetTupleDomainPredicate building, could you please rebase on the recent master?

zhenxiao · 2022-01-27T11:53:18Z

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

+            List<ColumnIndexStore> blockIndexStores = new ArrayList<>();
            for (BlockMetaData block : footerBlocks.build()) {
-                if (predicateMatches(parquetPredicate, block, finalDataSource, descriptorsByPath, parquetTupleDomain, failOnCorruptedParquetStatistics)) {
+                ColumnIndexStore ciStore = getColumnIndexStore(parquetPredicate, finalDataSource, block, descriptorsByPath, readColumnIndexFilter);


s/ciStore/columnIndexStore/g

Just replaced them

zhenxiao · 2022-01-27T11:54:10Z

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

        return getParquetType(prestoType, messageType, column, tableName, path);
    }
+
+    private static ColumnIndexStore getColumnIndexStore(Predicate parquetPredicate, ParquetDataSource dataSource, BlockMetaData blockMetadata, Map<List<String>, RichColumnDescriptor> descriptorsByPath, boolean readColumnIndexFilter)


return Optional, instead of null?

We talked earlier with @beinan about whether we should return optional or null for this methood. For this use case even we use optional, we still need to check .isPresent(). It is pretty much similar with null checking and not much value to do so. Here are some good explanations about null checking and Optional checking https://medium.com/javarevisited/null-check-vs-optional-are-they-same-c361d15fade3.

get it. I think we try to use Optional as much as possible, trying to get rid of nullPointer exceptions. What do you think @beinan ?

I am fine to replace it with Optional. But when I try it and see we still need to end up with null checking. This is because ColumnIndexStore.java that defined in Parquet repo is using null checking. So the class ParquetColumnIndexStore extends ColumnIndexStore need to have the same signature. So if we replace it with Optional, it ends up that we convert from null checking to Optional.empty() check in HDFSParquetDataSource and have to convert to Optiona.empty() to null checking in ParquetColumnIndexStore. Since we only use empty() API in Optional, as that article mentioned, null checking and empty() is not much different in this case. Thoughts?

get it. yep, bad Presto and Parquet are having different coding styles.
how about we try Optional as much as possible in Presto code? This reminds me old days when try to implement new Parquet reader for Presto :)

I have the same feeling, I also suggest use optional in the old PRs.

zhenxiao · 2022-01-27T11:55:15Z

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

            TupleDomain<DeltaColumnHandle> effectivePredicate,
-            FileFormatDataSourceStats stats)
+            FileFormatDataSourceStats stats,
+            boolean readColumnIndexFilter)


s/readColumnIndexFilter/columnIndexFilterEnabled/g

zhenxiao · 2022-01-27T11:56:46Z

presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java

+        }
+
+        boolean hasColumnIndex = false;
+        for (ColumnChunkMetaData column : blockMetadata.getColumns()) {


could we rewrite this with stream api? use anyMatch?

Great idea!

zhenxiao · 2022-01-27T11:57:37Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

        return this.materializedViewMissingPartitionsThreshold;
    }

+    @Config("hive.parquet-use-column-index-filter")


how about:
hive.parquet-column-index-filter-enabled?

zhenxiao · 2022-01-27T12:32:06Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnChunk.java

-                    valueCount += readDataPageV1(pageHeader, uncompressedPageSize, compressedPageSize, pages);
+                    firstRowIndex = PageReader.getFirstRowIndex(dataPageCount, offsetIndex);
+                    valueCount += readDataPageV1(pageHeader, uncompressedPageSize, compressedPageSize, firstRowIndex, pages);
+                    ++dataPageCount;


s/++dataPageCount/dataPageCount = dataPageCount + 1/g

zhenxiao · 2022-01-27T12:32:55Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnChunk.java

        return dataHeaderV2.getNum_values();
    }
+
+    private boolean hasMorePages(long valuesCountReadSoFar, int dataPageCountReadSoFar)


s/valuesCountReadSoFar/valuesCount/g
s/dataPageCountReadSoFar/pagesCount/g

zhenxiao · 2022-01-27T12:34:26Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

-        currentBlock = currentBlock + 1;
+
+        if (filter != null && readColumnIndexFilter) {
+            ColumnIndexStore ciStore = blockIndexStores.get(currentBlock);


s/ciStore/columnIndexStore/g

zhenxiao · 2022-01-27T12:35:02Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+                OffsetIndex offsetIndex = blockIndexStores.get(currentBlock).getOffsetIndex(metadata.getPath());
+                OffsetIndex filteredOffsetIndex = ColumnIndexFilterUtils.filterOffsetIndex(offsetIndex, currentGroupRowRanges, blocks.get(currentBlock).getRowCount());
+                List<OffsetRange> offsetRanges = ColumnIndexFilterUtils.calculateOffsetRanges(filteredOffsetIndex, metadata, offsetIndex.getOffset(0), startingPosition);
+                List<ConsecutivePartList> allParts = concatRanges(offsetRanges);


s/allParts/offsetRanges/g

We already have the variable offsetRanges. We concat it.

get it. how about:
consecutiveRanges?
generally, we do not use variable abbreviation in Presto

Changed it.

zhenxiao · 2022-01-27T12:36:13Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+     * Describes a list of consecutive parts to be read at once. A consecutive part may contain whole column chunks or
+     * only parts of them (some pages).
+     */
+    private class ConsecutivePartList


s/ConsecutivePartList/PageRanges/g

This name is from Parquet. Better to keep the same name? What do you think?

get it. We could either add in comments above, describing the private class is from Parquet, or rename it to PageRanges

I just replace it with PageRanges

shangxinli · 2022-01-29T22:43:01Z

nice work, @shangxinli left some comments recently did a parquet code refactor, to speedup ParquetTupleDomainPredicate building, could you please rebase on the recent master?

Yes, did it. Thanks for letting me know.

shangxinli · 2022-02-07T16:44:33Z

Except for the known build issue, @beinan @zhenxiao Do you still have other comments?

This is a pretty big change and it often has conflicts when new commits are in. The conflicts already happened 3 times and it is a painful process to manually resolve those conflicts.

zhenxiao

thank you, @shangxinli
mostly looks nice. some minor things

...parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java

zhenxiao · 2022-02-08T11:22:47Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/AbstractColumnReader.java

+     * values (and the related rl and dl) for the rows [20, 39] in the end of the page 0 for col2. Similarly, we have to
+     * skip values while reading page0 and page1 for col3.
+     */
+    private void processValuesSync(int valuesToRead, Consumer<Void> valueConsumer)


am still inclined to merge the two functions. The signatures are the same. New function looks like:
private void processValues(int valuesToRead, Consumer<Void> valueConsumer, boolean indexEnabled)

zhenxiao · 2022-02-08T11:24:42Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetColumnIndexStore.java

+        }
+    }
+
+    // Used for columns are not in this parquet file


we could remove this comment. Or,
for columns not in this parquet file

We can remove the comments

zhenxiao · 2022-02-08T11:26:11Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+                OffsetIndex offsetIndex = blockIndexStores.get(currentBlock).getOffsetIndex(metadata.getPath());
+                OffsetIndex filteredOffsetIndex = ColumnIndexFilterUtils.filterOffsetIndex(offsetIndex, currentGroupRowRanges, blocks.get(currentBlock).getRowCount());
+                List<OffsetRange> offsetRanges = ColumnIndexFilterUtils.calculateOffsetRanges(filteredOffsetIndex, metadata, offsetIndex.getOffset(0), startingPosition);
+                List<ConsecutivePartList> allParts = concatRanges(offsetRanges);


get it. how about:
consecutiveRanges?
generally, we do not use variable abbreviation in Presto

zhenxiao · 2022-02-08T11:27:33Z

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/ParquetReader.java

+     * Describes a list of consecutive parts to be read at once. A consecutive part may contain whole column chunks or
+     * only parts of them (some pages).
+     */
+    private class ConsecutivePartList


get it. We could either add in comments above, describing the private class is from Parquet, or rename it to PageRanges

zhenxiao

looks nice, @shangxinli
one minor issue
could you please squash all commits into one, and add release note?
also, make sure all tests passed

zhenxiao · 2022-02-10T03:25:15Z

...parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java

let's add message in the exception, like:
parquet type not supported: %s

beinan

lgtm, I think reviewed most of the code change in the previous PRs. Thank you @shangxinli ! Looking forward to seeing more contribution from parquet community! Thanks!

beinan · 2022-02-10T06:51:47Z

rerun the tests

shangxinli · 2022-02-10T21:19:45Z

Thanks @beinan @zhenxiao for your time to review!

rongrong · 2022-02-10T21:21:02Z

We don't do merge commits, please rebase your changes onto master and squash all changes into one commit. Thanks!

shangxinli · 2022-02-10T21:48:44Z

@rongrong That means I need to recreate a new PR?

shangxinli · 2022-02-11T02:54:31Z

Created a new PR #17284 to rebase due to so manage conflicts.

shangxinli force-pushed the column_indexes_dev_new branch 2 times, most recently from 1cb54cd to c8b834f Compare January 23, 2022 05:44

Skip reading Parquet pages using Column Indexes feature.

d12e7ff

shangxinli force-pushed the column_indexes_dev_new branch from c8b834f to d12e7ff Compare January 23, 2022 18:16

zhenxiao requested changes Jan 27, 2022

View reviewed changes

shangxinli added 2 commits January 28, 2022 14:08

Merge branch 'master' into column_indexes_dev_new

c293e79

Resolve confilct after rebase

6fa3a3a

shangxinli force-pushed the column_indexes_dev_new branch from 25f3e5b to 6760bb1 Compare January 31, 2022 05:38

Address feedback

6bccc1e

shangxinli force-pushed the column_indexes_dev_new branch from 563a586 to 6bccc1e Compare January 31, 2022 14:41

Merge branch 'master' into column_indexes_dev_new

b827422

zhenxiao requested changes Feb 8, 2022

View reviewed changes

shangxinli force-pushed the column_indexes_dev_new branch from 94c2ffa to e94aada Compare February 10, 2022 01:21

zhenxiao approved these changes Feb 10, 2022

View reviewed changes

beinan approved these changes Feb 10, 2022

View reviewed changes

Address feedbacks

47bfd42

shangxinli force-pushed the column_indexes_dev_new branch from ece19ac to 47bfd42 Compare February 10, 2022 21:41

shangxinli mentioned this pull request Feb 11, 2022

Skip reading Parquet pages using Column Indexes feature. #17284

Merged

shangxinli closed this Mar 15, 2022

Conversation

shangxinli commented Jan 23, 2022

Uh oh!

zhenxiao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Jan 29, 2022

Uh oh!

shangxinli commented Feb 7, 2022

Uh oh!

zhenxiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Feb 9, 2022 •

edited

Loading