-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Parquet reader - avoid redundant dictionary reads #27407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Parquet reader - avoid redundant dictionary reads #27407
Conversation
918ffd5 to
0c1761c
Compare
|
Started benchmark workflow for this PR with test type =
|
|
Started benchmark workflow for this PR with test type =
|
| private static final PredicateMatchResult FALSE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(false, null, null); | ||
| private static final PredicateMatchResult TRUE_WITH_DICT_MATCHING_NOT_NEEDED = new PredicateMatchResult(true, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we avoid null here ? Pass TupleDomain#all and an empty set ?
| import java.util.Set; | ||
|
|
||
| public record RowGroupInfo(PrunedBlockMetadata prunedBlockMetadata, long fileRowOffset, Optional<ColumnIndexStore> columnIndexStore) {} | ||
| public record RowGroupInfo(PrunedBlockMetadata prunedBlockMetadata, long fileRowOffset, Optional<ColumnIndexStore> columnIndexStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: break each field onto a separate line
| } | ||
| } | ||
| catch (Exception e) { | ||
| log.error(e, "Error while matching dictionary predicate for field " + field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to swallow an exception at this point? Seems like we should re-throw a variation of RuntimeException from dictionaryPredicateMatch (instead of throw IOException) and then not swallow that exception here since if we can't decode a dictionary page we should probably fail.
Description
Avoid redundant dictionary reads - earlier the dictionary page is read at the time of rowgroup pruning (
PredicateUtils) and then again read during the actual file reading(ParquetReader). This PR removes this redundant work and moves dictionary based rowgroup pruning logic toParquetReaderWould like to thank @pettyjamesm for his initial suggestions on this.
Some numbers with patch
These queries were run on 1TB TPC-DS data(parquet, zstd compressed).
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: