Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Supporting Composite Expressions (including standard Spark functions and UDFs) as Filter Expressions in Data Skipping flow.

For example, if previously we were only supporting rather simple expressions over Data Table attributes like WHERE columnA = 42, now we will be supporting much broader scope of expressions (strictly defined below) allowing for example Data Skipping to properly digest queries like WHERE date_format(columnA, 'MM/dd/yyyy') = '01/01/2022' referencing Spark's standard function date_format.

Formally, now supported are any expressions such that it

  1. References exactly 1 attribute (column, meaning that expressions like A * B = 0 are not supported)
  2. Does not contain sub-queries

Also, now as "value expression" we support any expression such that it

  1. Does not reference any other attribute (A = B filters are not supported)
  2. Does not contain sub-queries

Brief change log

  • Expanded scope for value expressions
  • Expanded scope for attribute expressions
  • Fixed resolveExpr util to properly resolve Spark functions
  • Grouped together logically equivalent expressions
  • Added tests

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [HUDI-512] Supporting Composite Expressions over Data Table Columns in Data Skipping flow [HUDI-512][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow Mar 9, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-512][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow [HUDI-3594][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow Mar 9, 2022
@alexeykudinkin alexeykudinkin force-pushed the ak/mtmod-idx-2 branch 4 times, most recently from 4e5830d to e4a45b8 Compare March 15, 2022 00:53
@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Mar 15, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-3594][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow Mar 15, 2022
@yihua yihua self-assigned this Mar 15, 2022
* @param tableSchema table schema encompassing attributes to resolve against
* @return Resolved filter expression
*/
def resolveExpr(spark: SparkSession, exprString: String, tableSchema: StructType): Expression = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not change

* @param partitionColumns The partition columns
* @return (partitionFilters, dataFilters)
*/
def splitPartitionAndDataPredicates(sparkSession: SparkSession,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not change

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems good to me. Just started reviewing through.

@xiarixiaoyao Would you have some cycles to review as well, since you wrote a lot of the original skipping utils code

* limitations under the License.
*/

package org.apache.spark.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this code adapted from somewhere? if so, can you please add source attribution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is our code. Had to place it in spark.sql to access package-private API


private object OrderPreservingTransformation {
def unapply(expr: Expression): Option[AttributeReference] = {
expr match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these are are the transformations that we whitelist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

//
// This expression will be translated into following Filter Expression for the Column Stats Index:
//
// ```(transform_expr(colA_minValue) <= value_expr) AND (value_expr <= transform_expr(colA_maxValue))```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nts: Let's take an example that parses a timestamp ts column into date using something like date_format.

date_format(ts, ...) = '2022-03-01'

We will simply look for files that have overlap with that date. sgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a test testing exactly this use-case

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can share some test results, we can land this. otherwise lgtm

@alexeykudinkin
Copy link
Contributor Author

@vinothchandar what test results are you referring to?

// NOTE: That we can apply ```transform_expr``` transformation precisely b/c it preserves the ordering of the
// values of the source column, ie following holds true:
//
// colA_minValue = min(colA) => transform_expr(colA_minValue) = min(transform_expr(colA))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appreciate the detailed explanation.

),
Seq("file_1")),
arguments(
"date_format(C, 'MM/dd/yyyy') IN ('03/08/2022')",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have support for AND with our data skipping?
date_format(C, 'MM/dd/yyyy') >= '03/08/2022' and date_format(C, 'MM/dd/yyyy') <= '03/10/2022'

I understand the test may not go into this test class. just asking in general, do we have tests elsewhere to cover this case.

Also, can we add a test for matching multiple entries.

date_format(C, 'MM/dd/yyyy') IN ('03/08/2022', '03/09/2022')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the data skipping class. looks like IN w/ multiple values is supported.

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@vinothchandar
Copy link
Member

@nsivabalan we can land this once CI passes

@yihua yihua removed their assignment Mar 23, 2022
Alexey Kudinkin added 9 commits March 24, 2022 17:39
…xpressions (in lieu of just "literals");

Tidying up
… any other attributes or holding sub-query exprs
…ort arbitrary expressions involving no more than single column;

Extracted common Column Stsat expression utils;
…ie expression containing exactly one attribute reference, and no sub-queries)
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 8b38dde into apache:master Mar 25, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants