Skip to content

Skip reading Parquet pages using Column Indexes feature.#17284

Merged
beinan merged 1 commit intoprestodb:masterfrom
shangxinli:column_index_final_one_commit
Feb 21, 2022
Merged

Skip reading Parquet pages using Column Indexes feature.#17284
beinan merged 1 commit intoprestodb:masterfrom
shangxinli:column_index_final_one_commit

Conversation

@shangxinli
Copy link
Copy Markdown
Collaborator

@shangxinli shangxinli commented Feb 11, 2022

Port some code from parquet-mr repo https://github.com/apache/parquet-mr
Co-authored-by: Gabor Szadovszky gabor.szadovszky@cloudera.com

More details about Parquet Column Indexes feature:

Column Indexes also named as page level indexes that have min/max values for each page in a given column chunk. When reading pages, a reader doesn't need to process the page header to determine whether the page could be skipped based on the statistics. More information about this feature can be found https://github.com/apache/parquet-format/blob/master/PageIndex.md

Test plan - (Please fill in how you tested your changes)

This feature was tested in the Uber staging environment and then rolled out to production for 5+ months.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* Add support to use Parquet page-level statistics to skip pages  

Hive Changes
* Introduce a flag "hive.parquet-column-index-filter-enabled" to turn on/off this feature. By default it is off.

@shangxinli
Copy link
Copy Markdown
Collaborator Author

shangxinli commented Feb 11, 2022

@rongrong @zhenxiao @beinan I just created this new PR for the Parquet column index to rebase. If you can merge it, it would be great! The old PRs can be found #17216 and #16963

Copy link
Copy Markdown
Member

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @shangxinli for you great work and rebase! Just two minor var naming issue, I guess you might miss these two places during the rebase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setColumnIndexFilterEnabled

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change to 'setColumnIndexFilterEnabled' it always causes some odd test failures that seem unrelated. I changed it to setColumnIndexFilter

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems even we change to setColumnIndexFilterEnabled still doesn't work ether. Just keep the original name for now.

Copy link
Copy Markdown
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangxinli nice work
A few more comments
we will speedup to merge this PR

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/canDropCanWithRangeStats/canDropWithRangeStatistics/g

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return columnDomain.intersect(domain).isNone();

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ciConversions/conversions/g

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentRow = currentRow + 1;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move pageIndex = pageIndex + 1 out of the function call?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is not useful. let's remove it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider merge PageRanges with OffsetRange into one class? They are quite similar

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the same but we have to convert between int/long.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, shall we use merge the two classes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'length' is int in PageRanges but long in OffsetRanges. If we merge, we need to cast, which we should generally avoid casting as much as we can.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to merge PageRanges into OffsetRange. We are casting OffsetRange length to int in code below. Doing the merge could remove duplicate code, and save the cast, too.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rangeStartPos/startPosition/g

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangxinli mostly good
a few remaining minor things

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangxinli could we fix here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, shall we use merge the two classes?

@shangxinli shangxinli force-pushed the column_index_final_one_commit branch 4 times, most recently from d3e1180 to f8d63e2 Compare February 17, 2022 06:16
@shangxinli
Copy link
Copy Markdown
Collaborator Author

@zhenxiao @beinan @rongrong Just addressed the comments. Please have a look if you have more.

Copy link
Copy Markdown
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @shangxinli mostly look
only 2 remaining minor issues. others looks good to me

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are casting OffsetRange length to int, shall we use OffsetRange directly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to merge PageRanges into OffsetRange. We are casting OffsetRange length to int in code below. Doing the merge could remove duplicate code, and save the cast, too.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly ping

@shangxinli
Copy link
Copy Markdown
Collaborator Author

Addressed the last two comments from @zhenxiao.

@shangxinli shangxinli force-pushed the column_index_final_one_commit branch from b5164a7 to cde2032 Compare February 18, 2022 23:13
@zhenxiao
Copy link
Copy Markdown
Collaborator

looks good to me
hi @shangxinli could you please keep the last 2 comments addressed, and squash all commits into one?
Then @beinan and I could merge it

@shangxinli
Copy link
Copy Markdown
Collaborator Author

Yeah, I did. But the test keep failing with the error "didn't finish within the time-out 60000". I revert it but still failed. It shouldn't be related to the change itself. I will add back the last commit and squash.

@shangxinli shangxinli force-pushed the column_index_final_one_commit branch from ce3b0e4 to 0a98e6d Compare February 19, 2022 16:34
Port some code from parquet-mr repo https://github.com/apache/parquet-mr
Co-authored-by: Gabor Szadovszky <gabor.szadovszky@cloudera.com>

More details about Parquet Column Indexes feature:

Column Indexes also named as page level indexes that have min/max values for each page in a given column chunk. When reading pages, a reader doesn't need to process the page header to determine whether the page could be skipped based on the statistics. More information about this feature can be found https://github.com/apache/parquet-format/blob/master/PageIndex.md
@shangxinli shangxinli force-pushed the column_index_final_one_commit branch from 0a98e6d to 58db92d Compare February 20, 2022 17:48
@beinan beinan merged commit 3cb1ee2 into prestodb:master Feb 21, 2022
@aweisberg
Copy link
Copy Markdown
Contributor

aweisberg commented Feb 22, 2022

Does this break the build?

[ERROR] Failed to execute goal on project presto-parquet: Could not resolve dependencies for project com.facebook.presto:presto-parquet:jar:0.272-SNAPSHOT: Failed to collect dependencies at junit:junit:jar:4.13.1: Failed to read artifact descriptor for junit:junit:jar:4.13.1: Could not transfer artifact junit:junit:pom:4.13.1 from/to nexus (https://maven.thefacebook.com/nexus/content/groups/public): Transfer failed for https://maven.thefacebook.com/nexus/content/groups/public/junit/junit/4.13.1/junit-4.13.1.pom: No route to host (Host unreachable) -> [Help 1]

Looks like a real version of junit, maybe something wrong @ Central?

@beinan
Copy link
Copy Markdown
Member

beinan commented Feb 22, 2022

Does this break the build?

[ERROR] Failed to execute goal on project presto-parquet: Could not resolve dependencies for project com.facebook.presto:presto-parquet:jar:0.272-SNAPSHOT: Failed to collect dependencies at junit:junit:jar:4.13.1: Failed to read artifact descriptor for junit:junit:jar:4.13.1: Could not transfer artifact junit:junit:pom:4.13.1 from/to nexus (https://maven.thefacebook.com/nexus/content/groups/public): Transfer failed for https://maven.thefacebook.com/nexus/content/groups/public/junit/junit/4.13.1/junit-4.13.1.pom: No route to host (Host unreachable) -> [Help 1]

Looks like a real version of junit, maybe something wrong @ Central?

hmmm, very likely, let me remove the real junit version. Just posted a PR to fix it #17334 @aweisberg

@varungajjala varungajjala mentioned this pull request Mar 22, 2022
9 tasks
@asjadsyed asjadsyed mentioned this pull request Mar 23, 2022
9 tasks
@asjadsyed asjadsyed mentioned this pull request Apr 1, 2022
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants