Skip to content

Dereference projections and predicate pushdown in Hive#1720

Merged
martint merged 11 commits intotrinodb:masterfrom
phd3:hive-deref-pushdown
Apr 9, 2020
Merged

Dereference projections and predicate pushdown in Hive#1720
martint merged 11 commits intotrinodb:masterfrom
phd3:hive-deref-pushdown

Conversation

@phd3
Copy link
Copy Markdown
Member

@phd3 phd3 commented Oct 11, 2019

(required by #1953)

This PR includes the following major changes:

1. Add support for projected columns in Hive

  • Page sources need to be supplied projected columns representing internal fields from a struct column. We modify HiveColumnHandle to include this information about column subfield access. Right now, the HiveColumnProjectionInfo simply stores a chain of dereferences, but that can later be extended to support other types of partial columns (eg. subscripts)

  • Internal page sources in Hive Connector may or may not be able to push down projected column reads, depending on the formats. Reader page sources or record cursor can provide the partial projections they'll apply, and the rest is "adapted" by the HivePageSource or HiveReaderProjectionsAdaptingRecordCursor.

  • By default, the readers only read base columns for projected columns. For example, column "a" will be read for a projection "a.x". This is implemented through ReaderProjections#projectBaseColumns

2. Pushdown for Dereferences in Hive:

  • The goal is to pushdown dereference expressions to Hive Connector for projections and predicates both.

  • As described in Pushdown of dereference expressions (with focus on Hive connector) #1953, creating partial column handles in HiveMetadata#applyProjection will give us both predicate and projection pushdown into Hive through iterative optimization.

  • HiveMetadata#applyProjection extracts longest simple dereference chains and variables from input expressions and provies new columnhandles for them. All the columnhandles returned by this method are unique.

3. Projection and Predicate pushdown of subfields in Columnar Sources:

  • In case of Columnar formats (eg. ORC, Parquet), the partial columns can be read without reading entire columns, which is the primary motive here.

  • We implement projectSufficientColumns for such page sources to leverage, which lets the pagesource only read the required superset columns. Say we have a column C of type RowType. The query asks for a set of subcolumns S from column C. Then the superset columns here is the smallest set of columns from S such that all columns in S can be derived using identity or dereference expressions. i.e when S = {"a.b.c", "a.d" and "a.b.c.x"}, the reader will read create columns {"a.b.c", "a.d"}. The HivePageSource will adapt for ".x" for the third column.

  • ORC and Parquet changes are extracted out in different PRs.

Copy link
Copy Markdown
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of initial comments that will affect the overall approach and scope. I'm reviewing the core logic now.

@phd3 phd3 force-pushed the hive-deref-pushdown branch from 1787d4b to fb292f9 Compare October 24, 2019 00:31
@phd3 phd3 changed the title WIP dereference pushdown Dereference pushdown Oct 24, 2019
@phd3 phd3 changed the title Dereference pushdown Connector Dereference pushdown Oct 24, 2019
Copy link
Copy Markdown
Contributor

@wagnermarkd wagnermarkd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. I don't think the DereferenceAdaption has the desired effect. More details inline.

W.r.t. sequencing, can we handle partial pushdown or unique pushdowns in a separate RB?

@phd3 phd3 force-pushed the hive-deref-pushdown branch from fb292f9 to d6c5c76 Compare October 25, 2019 19:37
@phd3 phd3 changed the title Connector Dereference pushdown Dereference projections pushdown in Hive Oct 25, 2019
@phd3
Copy link
Copy Markdown
Member Author

phd3 commented Oct 25, 2019

@wagnermarkd split the PR into two parts. #1868 deals with the engine side changes.

@phd3 phd3 force-pushed the hive-deref-pushdown branch from d6c5c76 to 92a2dd9 Compare November 9, 2019 02:00
@phd3 phd3 added the WIP label Nov 9, 2019
@phd3 phd3 self-assigned this Nov 9, 2019
@phd3 phd3 force-pushed the hive-deref-pushdown branch 7 times, most recently from d80cbb6 to c84be63 Compare November 13, 2019 20:00
@phd3 phd3 removed the WIP label Nov 13, 2019
@phd3
Copy link
Copy Markdown
Member Author

phd3 commented Nov 13, 2019

@wagnermarkd @martint this PR is ready for review again.

@phd3 phd3 force-pushed the hive-deref-pushdown branch 4 times, most recently from 9725ec7 to 112af55 Compare November 18, 2019 23:20
@phd3 phd3 changed the title Dereference projections pushdown in Hive Dereference projections and predicate pushdown in Hive Nov 19, 2019
Copy link
Copy Markdown
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you forgot to address this comment.


/**
* Returns the assignment key corresponding to the column represented by {@param projectedColumn} in the {@param assignments}, if one exists.
* The variable in the {@param projectedColumn} can itself be a representation of another projected column. For example,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment reflect new behavior? If not, move it to the commit that introduced the method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phd3 phd3 force-pushed the hive-deref-pushdown branch 3 times, most recently from 826f84b to cfbbc51 Compare March 26, 2020 06:41
@phd3 phd3 force-pushed the hive-deref-pushdown branch from cfbbc51 to 9769f3a Compare March 29, 2020 23:19
@phd3
Copy link
Copy Markdown
Member Author

phd3 commented Mar 30, 2020

@martint Thanks for the review, addressed your comments. The test failure looks unrelated, and doesn't seem to reproduce locally.

There're two main other changes during the rebase:

  1. Excluding test for applyProjection from alluxio metastore since it involves creating a table
  2. Resolving conflicts to use TableToPartitionMapping introduced instead of coercions in HivePageSourceProvider::buildColumnMappings.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the same fields as equals

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check for column being a base hive column


/**
* Returns the assignment key corresponding to the column represented by {@param projectedColumn} in the {@param assignments}, if one exists.
* The variable in the {@param projectedColumn} can itself be a representation of another projected column. For example,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phd3 phd3 force-pushed the hive-deref-pushdown branch from 9769f3a to c8eb95f Compare March 31, 2020 23:54
@phd3 phd3 force-pushed the hive-deref-pushdown branch from c8eb95f to 8b7f573 Compare April 1, 2020 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants