Support Parquet TupleDomain using ColumnDescriptor by zhenxiao · Pull Request #6892 · prestodb/presto

zhenxiao · 2016-12-15T02:10:12Z

Currently Parquet TupleDomain is constructed based on HiveColumnHandle. This would not work if Nested predicate are pushed down, e.g.

select s.a
from t
where s.b > 10

This patch construct Parquet TupleDomain with Parquet's ColumnDescriptor, so that it could work with nested predicate pushdown

zhenxiao · 2016-12-15T02:13:39Z

@nezihyigitbasi @dain this is a first step to support nested predicate pushdown for Parquet in Presto. This PR is to make Parquet TupleDomain based on Parquet ColumnDescriptor, so that when nested predicates are pushed down, it could easily work

zhenxiao · 2016-12-29T05:01:29Z

@dain @nezihyigitbasi kindly ping

zhenxiao · 2017-01-19T01:01:04Z

@dain @nezihyigitbasi kindly ping

zhenxiao · 2017-01-25T14:06:23Z

@dain @nezihyigitbasi could you please review when you are free?

zhenxiao · 2017-02-08T22:14:37Z

@nezihyigitbasi @dain kindly ping

zhenxiao · 2017-02-28T15:45:43Z

@nezihyigitbasi @dain kindly ping

dain · 2017-03-10T18:43:24Z

In an offline discussion we decided that @zhenxiao was going to investigate using synthetic virtual-columns in the connector to enable this push-down feature.

zhenxiao · 2017-03-10T19:20:25Z

Hi @dain:
This PR is to make Parquet TupleDomain based on Parquet ColumnDescriptor, instead of index based. It is purely Parquet stuff. This is a preparation for nested predicate pushdown stuff. It is beneficial for existing Parquet stuff as well.

Could you please review this first?

dain · 2017-03-10T20:23:32Z

Ah ok. Will do.

zhenxiao · 2017-04-03T23:53:14Z

@dain @nezihyigitbasi kindly ping

nezihyigitbasi · 2017-04-04T18:36:10Z

@zhenxiao I will take a look at this shortly.

nezihyigitbasi

@zhenxiao I did a first pass, back to you.

nezihyigitbasi · 2017-04-04T22:08:33Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTypeUtils.java

This doesn't handle the decimal type correctly, please see Parquet spec and my recent related fix 37c57c9

thank you, get it resolved

nezihyigitbasi · 2017-04-04T22:10:33Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java

now that the typeManager is not used anymore in this method we can remove it.

get it, resolved

nezihyigitbasi · 2017-04-04T22:12:24Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicateUtils.java

static import Map.Entry

yep, updated

nezihyigitbasi · 2017-04-04T22:17:44Z

...to-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetColumnReference.java

This class already has access to the column descriptor so we shouldn't need to pass the Presto type as well since Presto type can be derived from that (like you do in getPrestoType()). So I guess we can move getPrestoType() logic to this class.

get it. this class not needed at all. Will delete and use RichColumnDescriptor directly

nezihyigitbasi · 2017-04-04T22:23:10Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicate.java

@param statistics column statistics

nezihyigitbasi · 2017-04-04T22:28:48Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicateUtils.java

I don't think you need the paths variable, you can just pass Arrays.asList(columnMetaData.getPath().toArray()) below.

get it. Updated

nezihyigitbasi · 2017-04-04T22:30:56Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicateUtils.java

Although not entirely related to this patch, why are we ignoring the IOException here? If we get an exception we will silently skip populating dictionaries for that particular column.

if hitting any problem when trying to read the dictionary page, we will just silently skip, and will not use that dictionary to try skip reading row groups

OK as long as we are not messing up while we read.

nezihyigitbasi · 2017-04-04T22:32:36Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicateUtils.java

What will happen if the descriptor is not present and we return an empty map of dictionaries?

only build dictionaries when descriptor exist, otherwise, dictionaries is empty, will scan file, dictionary predicate will not apply

nezihyigitbasi · 2017-04-04T22:37:52Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicateUtils.java

getDomains() returns an Optional, are we sure it's non-empty? If yes, we should better assert that (com.google.common.base.Verify::verify).

nezihyigitbasi · 2017-04-05T00:19:19Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTypeUtils.java

throw new PrestoException(NOT_SUPPORTED, "Unsupported parquet type: " + descriptor.getType());

yep, updated

zhenxiao · 2017-04-05T03:51:21Z

thank you @nezihyigitbasi
get comments addressed

nezihyigitbasi

LGTM except some minor comments. BTW do we need similar changes in the new reader for nested predicate pushdown support?

nezihyigitbasi · 2017-04-10T20:04:00Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTypeUtils.java

now that we have this method in ParquetTypeUtils we can update ParquetColumnReader to also call this one instead of maintaining two copies.

yep, get it updated

nezihyigitbasi · 2017-04-10T20:07:02Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/predicate/ParquetPredicateUtils.java

"parquetTupleDomain is empty"

zhenxiao · 2017-04-11T02:54:20Z

thank you @nezihyigitbasi get comments addressed
Yes, need to do something like virtual column to push down nested columns. New Parquet Reader part support is there

nezihyigitbasi · 2017-04-11T18:22:43Z

@dain I think this looks good.

dain

Just one minor comment. Otherwise, Nezih, merge whenever you want.

dain · 2017-04-13T00:36:40Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTypeUtils.java

Can we rename i and j to something more descriptive? Maybe columnIndex and level (and rename level to maxLevel)?

zhenxiao · 2017-04-13T00:58:02Z

thank you @dain @nezihyigitbasi
get comments addressed

nezihyigitbasi · 2017-04-13T01:06:23Z

thanks @zhenxiao I will merge this once the tests all pass.

shurvitz · 2019-10-23T15:05:58Z

Hi @zhenxiao ,
I have a question regarding this change to getDictionaries(). It looks like the change folded a loop inside a loop into a single loop. What I'm noticing in Presto217 is a decrease in bytes-scanned during the predicate push-down process. This is the end result of that loop hard-existing via break after the first predicate descriptor is matched and a dictionary page is added. Before the change, the loop would iterate over all descriptors... Was leaving the break statement intentional?

zhenxiao · 2019-10-23T19:11:23Z

Hi @shurvitz thanks for reaching out.
I did not quite get ur question, this PR is actually a refactoring, leveraging ColumnDescriptor in Parquet, instead of integer index, for dictionary pushdown and predicate pushdown.
Could you please elaborate more, with file line number?

zhenxiao · 2019-10-24T00:28:50Z

Get it. Thank you, @shurvitz @pettyjamesm

zhenxiao requested review from dain and nezihyigitbasi December 15, 2016 02:10

facebook-github-bot added the CLA Signed label Dec 15, 2016

zhenxiao force-pushed the parquet-tuple-domain branch from 0dba438 to eef5773 Compare December 15, 2016 06:16

zhenxiao force-pushed the parquet-tuple-domain branch from eef5773 to fb3cf7a Compare January 7, 2017 06:50

zhenxiao changed the title ~~Support TupleDomain for Parquet~~ Support Parquet TupleDomain using ColumnDescriptor Jan 7, 2017

zhenxiao force-pushed the parquet-tuple-domain branch 2 times, most recently from 2dfa193 to 470f87a Compare January 10, 2017 01:39

zhenxiao mentioned this pull request Jan 11, 2017

Nested predicate push down to Parquet Reader #7045

Closed

zhenxiao force-pushed the parquet-tuple-domain branch 2 times, most recently from b3aec2e to b8e275f Compare January 11, 2017 21:57

zhenxiao force-pushed the parquet-tuple-domain branch from b8e275f to 71e17e6 Compare January 25, 2017 13:57

dain assigned zhenxiao Mar 10, 2017

dain removed their request for review March 10, 2017 18:43

dain assigned dain and unassigned zhenxiao Mar 10, 2017

dain self-requested a review March 10, 2017 20:23

dain assigned nezihyigitbasi and unassigned dain Mar 17, 2017

dain removed their request for review March 17, 2017 19:10

nezihyigitbasi requested changes Apr 5, 2017

View reviewed changes

zhenxiao force-pushed the parquet-tuple-domain branch from 71e17e6 to bbed115 Compare April 5, 2017 03:50

nezihyigitbasi reviewed Apr 10, 2017

View reviewed changes

zhenxiao force-pushed the parquet-tuple-domain branch from bbed115 to 5eda135 Compare April 11, 2017 02:52

nezihyigitbasi approved these changes Apr 11, 2017

View reviewed changes

nezihyigitbasi assigned dain and unassigned nezihyigitbasi Apr 12, 2017

dain approved these changes Apr 13, 2017

View reviewed changes

dain assigned nezihyigitbasi and unassigned dain Apr 13, 2017

zhenxiao force-pushed the parquet-tuple-domain branch from 5eda135 to ff5cd08 Compare April 13, 2017 00:57

Support Parquet TupleDomain using ColumnDescriptor

ff5cd08

nezihyigitbasi merged commit 0f7982b into prestodb:master Apr 13, 2017

zhenxiao deleted the parquet-tuple-domain branch April 15, 2017 00:02

shubhamtagra mentioned this pull request Oct 3, 2017

Query failures due to disparity in Presto type deduction logic from Hive schema and from Parquet column descriptor #9084

Closed

ryanrupp mentioned this pull request Mar 16, 2018

Add support for DATE predicate pushdown with Parquet via min/max and … #10181

Closed

pettyjamesm mentioned this pull request Oct 23, 2019

Parquet Dictionary Predicate Pushdown Fixes trinodb/trino#1846

Merged

pettyjamesm mentioned this pull request Oct 23, 2019

Parquet Dictionary Predicate Pushdown Fixes #13594

Merged

Conversation

zhenxiao commented Dec 15, 2016

Uh oh!

zhenxiao commented Dec 15, 2016

Uh oh!

zhenxiao commented Dec 29, 2016

Uh oh!

zhenxiao commented Jan 19, 2017

Uh oh!

zhenxiao commented Jan 25, 2017

Uh oh!

zhenxiao commented Feb 8, 2017

Uh oh!

zhenxiao commented Feb 28, 2017

Uh oh!

dain commented Mar 10, 2017

Uh oh!

zhenxiao commented Mar 10, 2017

Uh oh!

dain commented Mar 10, 2017

Uh oh!

zhenxiao commented Apr 3, 2017

Uh oh!

nezihyigitbasi commented Apr 4, 2017

Uh oh!

nezihyigitbasi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao commented Apr 5, 2017

Uh oh!

nezihyigitbasi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhenxiao commented Apr 11, 2017