Improve logic for reading dictionaries during row group pruning by raunaqmorarka · Pull Request #14247 · trinodb/trino

raunaqmorarka · 2022-09-22T06:15:51Z

Description

Improve logic for reading dictionaries during row group pruning in parquet reader

Non-technical explanation

Improves parquet reader performance in the presence of predicates.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Delta, Iceberg
* Improve performance of reading parquet files for queries with predicates. ({issue}`14247`)

When the nulls count in column statistics for a column is 0, we can use that for setting nullAllowed to false in parquet dictionary domains

skrzypo987

Skimmed, looks good

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/Predicate.java

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java

Perform column index and dictionary lookups only for the subset of columns where it can be useful. This prevents unnecessary filesystem reads and decoding work when the predicate on a column comes from a connector's file-level min/max stats or more generally when the predicate selects a domain equal to or wider than row-group min/max.

Limits the size of the filesystem read to only the required length when fetching dictionary for row-group pruning in parquet

sopel39

Do you think we should introduce feature flag for optimized reader first?

sopel39 · 2022-09-23T14:44:27Z

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/DictionaryDescriptor.java

@@ -21,11 +21,13 @@
 public class DictionaryDescriptor


When the nulls count in column statistics for a column is 0

Do you think it's possible for some writer to set it incorrectly?

If column statistics are incorrect then our existing predicate pushdown logic could also give wrong results, so we should be able to rely on it

raunaqmorarka · 2022-09-23T15:11:24Z

Do you think we should introduce feature flag for optimized reader first?

These are targeted fixes/improvements in existing code and not a rewrite of the reader, so these shouldn't need to be behind a flag

Populate nullAllowed for parquet dictionary domains

c2cd4e4

When the nulls count in column statistics for a column is 0, we can use that for setting nullAllowed to false in parquet dictionary domains

cla-bot bot added the cla-signed label Sep 22, 2022

raunaqmorarka requested review from martint, skrzypo987 and sopel39 September 22, 2022 06:16

raunaqmorarka added the performance label Sep 22, 2022

github-actions bot added the tests:hive label Sep 22, 2022

skrzypo987 reviewed Sep 22, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/Predicate.java Outdated Show resolved Hide resolved

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the pqr-dictionary branch from e767bb9 to 80bd124 Compare September 22, 2022 08:24

raunaqmorarka added 4 commits September 22, 2022 13:59

Pre-allocate lists in TupleDomainParquetPredicate to avoid resizing

f9c73f7

Avoid reading full column chunk to get dictionary

ac0c354

Limits the size of the filesystem read to only the required length when fetching dictionary for row-group pruning in parquet

Skip processing large dictionaries for row-group pruning

184dc6d

raunaqmorarka force-pushed the pqr-dictionary branch from 80bd124 to 184dc6d Compare September 22, 2022 08:29

sopel39 approved these changes Sep 23, 2022

View reviewed changes

raunaqmorarka merged commit 70bd5d7 into trinodb:master Sep 24, 2022

raunaqmorarka deleted the pqr-dictionary branch September 24, 2022 10:12

github-actions bot added this to the 398 milestone Sep 24, 2022

raunaqmorarka mentioned this pull request Sep 24, 2022

Release notes for 398 #14245

Closed

colebow mentioned this pull request Sep 28, 2022

Add Trino 398 release notes #14319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logic for reading dictionaries during row group pruning#14247

Improve logic for reading dictionaries during row group pruning#14247
raunaqmorarka merged 5 commits intotrinodb:masterfrom
raunaqmorarka:pqr-dictionary

raunaqmorarka commented Sep 22, 2022 •

edited

Loading

Uh oh!

skrzypo987 left a comment

Uh oh!

Uh oh!

Uh oh!

sopel39 left a comment

Uh oh!

sopel39 Sep 23, 2022

Uh oh!

raunaqmorarka Sep 23, 2022

Uh oh!

raunaqmorarka commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

raunaqmorarka commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Non-technical explanation

Release notes

Uh oh!

skrzypo987 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

sopel39 Sep 23, 2022

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Sep 23, 2022

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

raunaqmorarka commented Sep 22, 2022 •

edited

Loading