Skip to content

Improve logic for reading dictionaries during row group pruning#14247

Merged
raunaqmorarka merged 5 commits intotrinodb:masterfrom
raunaqmorarka:pqr-dictionary
Sep 24, 2022
Merged

Improve logic for reading dictionaries during row group pruning#14247
raunaqmorarka merged 5 commits intotrinodb:masterfrom
raunaqmorarka:pqr-dictionary

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Sep 22, 2022

Description

Improve logic for reading dictionaries during row group pruning in parquet reader

Non-technical explanation

Improves parquet reader performance in the presence of predicates.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Delta, Iceberg
* Improve performance of reading parquet files for queries with predicates. ({issue}`14247`)

When the nulls count in column statistics for a column
is 0, we can use that for setting nullAllowed to false
in parquet dictionary domains
Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed, looks good

Perform column index and dictionary lookups only for the subset of
columns where it can be useful. This prevents unnecessary
filesystem reads and decoding work when the predicate on a column
comes from a connector's file-level min/max stats or more generally
when the predicate selects a domain equal to or wider than row-group min/max.
Limits the size of the filesystem read to only the required
length when fetching dictionary for row-group pruning in parquet
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should introduce feature flag for optimized reader first?

@@ -21,11 +21,13 @@
public class DictionaryDescriptor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the nulls count in column statistics for a column is 0

Do you think it's possible for some writer to set it incorrectly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If column statistics are incorrect then our existing predicate pushdown logic could also give wrong results, so we should be able to rely on it

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Do you think we should introduce feature flag for optimized reader first?

These are targeted fixes/improvements in existing code and not a rewrite of the reader, so these shouldn't need to be behind a flag

@raunaqmorarka raunaqmorarka merged commit 70bd5d7 into trinodb:master Sep 24, 2022
@raunaqmorarka raunaqmorarka deleted the pqr-dictionary branch September 24, 2022 10:12
@github-actions github-actions bot added this to the 398 milestone Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants