Dereference projections and predicate pushdown in Hive#1720
Dereference projections and predicate pushdown in Hive#1720martint merged 11 commits intotrinodb:masterfrom
Conversation
martint
left a comment
There was a problem hiding this comment.
A couple of initial comments that will affect the overall approach and scope. I'm reviewing the core logic now.
presto-spi/src/main/java/io/prestosql/spi/expression/ConnectorExpressionVisitor.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
1787d4b to
fb292f9
Compare
wagnermarkd
left a comment
There was a problem hiding this comment.
Left a few comments. I don't think the DereferenceAdaption has the desired effect. More details inline.
W.r.t. sequencing, can we handle partial pushdown or unique pushdowns in a separate RB?
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveColumnHandle.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveColumnHandle.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HivePageSourceProvider.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HivePageSourceProvider.java
Outdated
Show resolved
Hide resolved
fb292f9 to
d6c5c76
Compare
|
@wagnermarkd split the PR into two parts. #1868 deals with the engine side changes. |
d6c5c76 to
92a2dd9
Compare
d80cbb6 to
c84be63
Compare
|
@wagnermarkd @martint this PR is ready for review again. |
9725ec7 to
112af55
Compare
b65f374 to
6bd7a15
Compare
presto-hive/src/main/java/io/prestosql/plugin/hive/parquet/ParquetPageSourceFactory.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/parquet/ParquetPageSourceFactory.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/rcfile/RcFilePageSourceFactory.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/rcfile/RcFilePageSourceFactory.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/s3select/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/HiveApplyProjectionUtil.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I think you forgot to address this comment.
presto-hive/src/main/java/io/prestosql/plugin/hive/ReaderProjections.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Returns the assignment key corresponding to the column represented by {@param projectedColumn} in the {@param assignments}, if one exists. | ||
| * The variable in the {@param projectedColumn} can itself be a representation of another projected column. For example, |
There was a problem hiding this comment.
Does this comment reflect new behavior? If not, move it to the commit that introduced the method.
There was a problem hiding this comment.
resolved the mixup: 26b83a5#diff-537e8077e259dbc890e77332fc3fa6ceR105
826f84b to
cfbbc51
Compare
cfbbc51 to
9769f3a
Compare
|
@martint Thanks for the review, addressed your comments. The test failure looks unrelated, and doesn't seem to reproduce locally. There're two main other changes during the rebase:
|
There was a problem hiding this comment.
should be the same fields as equals
There was a problem hiding this comment.
Add a check for column being a base hive column
|
|
||
| /** | ||
| * Returns the assignment key corresponding to the column represented by {@param projectedColumn} in the {@param assignments}, if one exists. | ||
| * The variable in the {@param projectedColumn} can itself be a representation of another projected column. For example, |
There was a problem hiding this comment.
resolved the mixup: 26b83a5#diff-537e8077e259dbc890e77332fc3fa6ceR105
There was a problem hiding this comment.
resolved the mixup: 8dbd476#diff-9008b1573bb475fa173bd5917b1f249aR183
9769f3a to
c8eb95f
Compare
c8eb95f to
8b7f573
Compare
This method enables creating only the minimally required columns for reading a set of column handles". e.g. for hive columns ["a.b", "a", "c"], this method will create projections using columns "a" and "c", since "a.b" can be projected from "a".
(required by #1953)
This PR includes the following major changes:
1. Add support for projected columns in Hive
Page sources need to be supplied projected columns representing internal fields from a struct column. We modify
HiveColumnHandleto include this information about column subfield access. Right now, theHiveColumnProjectionInfosimply stores a chain of dereferences, but that can later be extended to support other types of partial columns (eg. subscripts)Internal page sources in Hive Connector may or may not be able to push down projected column reads, depending on the formats. Reader page sources or record cursor can provide the partial projections they'll apply, and the rest is "adapted" by the HivePageSource or HiveReaderProjectionsAdaptingRecordCursor.
By default, the readers only read base columns for projected columns. For example, column "a" will be read for a projection "a.x". This is implemented through
ReaderProjections#projectBaseColumns2. Pushdown for Dereferences in Hive:
The goal is to pushdown dereference expressions to Hive Connector for projections and predicates both.
As described in Pushdown of dereference expressions (with focus on Hive connector) #1953, creating partial column handles in
HiveMetadata#applyProjectionwill give us both predicate and projection pushdown into Hive through iterative optimization.HiveMetadata#applyProjectionextracts longest simple dereference chains and variables from input expressions and provies new columnhandles for them. All the columnhandles returned by this method are unique.3. Projection and Predicate pushdown of subfields in Columnar Sources:
In case of Columnar formats (eg. ORC, Parquet), the partial columns can be read without reading entire columns, which is the primary motive here.
We implement p
rojectSufficientColumnsfor such page sources to leverage, which lets the pagesource only read the required superset columns. Say we have a column C of typeRowType. The query asks for a set of subcolumns S from column C. Then the superset columns here is the smallest set of columns from S such that all columns in S can be derived using identity or dereference expressions. i.e when S = {"a.b.c", "a.d" and "a.b.c.x"}, the reader will read create columns {"a.b.c", "a.d"}. The HivePageSource will adapt for ".x" for the third column.ORC and Parquet changes are extracted out in different PRs.