[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow #4996

alexeykudinkin · 2022-03-09T06:47:57Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Supporting Composite Expressions (including standard Spark functions and UDFs) as Filter Expressions in Data Skipping flow.

For example, if previously we were only supporting rather simple expressions over Data Table attributes like WHERE columnA = 42, now we will be supporting much broader scope of expressions (strictly defined below) allowing for example Data Skipping to properly digest queries like WHERE date_format(columnA, 'MM/dd/yyyy') = '01/01/2022' referencing Spark's standard function date_format.

Formally, now supported are any expressions such that it

References exactly 1 attribute (column, meaning that expressions like A * B = 0 are not supported)
Does not contain sub-queries

Also, now as "value expression" we support any expression such that it

Does not reference any other attribute (A = B filters are not supported)
Does not contain sub-queries

Brief change log

Expanded scope for value expressions
Expanded scope for attribute expressions
Fixed resolveExpr util to properly resolve Spark functions
Grouped together logically equivalent expressions
Added tests

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

alexeykudinkin · 2022-03-15T19:24:08Z

...nt/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala

+   * @param tableSchema table schema encompassing attributes to resolve against
+   * @return Resolved filter expression
+   */
+  def resolveExpr(spark: SparkSession, exprString: String, tableSchema: StructType): Expression = {


Did not change

alexeykudinkin · 2022-03-15T19:24:15Z

...nt/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala

+   * @param partitionColumns The partition columns
+   * @return (partitionFilters, dataFilters)
+   */
+  def splitPartitionAndDataPredicates(sparkSession: SparkSession,


Did not change

vinothchandar

The approach seems good to me. Just started reviewing through.

@xiarixiaoyao Would you have some cycles to review as well, since you wrote a lot of the original skipping utils code

vinothchandar · 2022-03-21T15:36:37Z

...nt/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql


is this code adapted from somewhere? if so, can you please add source attribution

Nope, this is our code. Had to place it in spark.sql to access package-private API

vinothchandar · 2022-03-21T15:48:39Z

...ce/hudi-spark2/src/main/scala/org/apache/spark/sql/HoodieSpark2CatalystExpressionUtils.scala

+
+  private object OrderPreservingTransformation {
+    def unapply(expr: Expression): Option[AttributeReference] = {
+      expr match {


I guess these are are the transformations that we whitelist

vinothchandar · 2022-03-21T15:57:25Z

...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala

+    //
+    // This expression will be translated into following Filter Expression for the Column Stats Index:
+    //
+    //   ```(transform_expr(colA_minValue) <= value_expr) AND (value_expr <= transform_expr(colA_maxValue))```


nts: Let's take an example that parses a timestamp ts column into date using something like date_format.

date_format(ts, ...) = '2022-03-01'

We will simply look for files that have overlap with that date. sgtm

There's a test testing exactly this use-case

vinothchandar

If you can share some test results, we can land this. otherwise lgtm

alexeykudinkin · 2022-03-21T21:36:31Z

@vinothchandar what test results are you referring to?

nsivabalan · 2022-03-19T02:52:24Z

...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala

+    // NOTE: That we can apply ```transform_expr``` transformation precisely b/c it preserves the ordering of the
+    //       values of the source column, ie following holds true:
+    //
+    //       colA_minValue = min(colA)  =>  transform_expr(colA_minValue) = min(transform_expr(colA))


appreciate the detailed explanation.

nsivabalan · 2022-03-21T22:39:46Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestDataSkippingUtils.scala

+        ),
+        Seq("file_1")),
+      arguments(
+        "date_format(C, 'MM/dd/yyyy') IN ('03/08/2022')",


do we have support for AND with our data skipping?
date_format(C, 'MM/dd/yyyy') >= '03/08/2022' and date_format(C, 'MM/dd/yyyy') <= '03/10/2022'

I understand the test may not go into this test class. just asking in general, do we have tests elsewhere to cover this case.

Also, can we add a test for matching multiple entries.

date_format(C, 'MM/dd/yyyy') IN ('03/08/2022', '03/09/2022')

I reviewed the data skipping class. looks like IN w/ multiple values is supported.

alexeykudinkin · 2022-03-22T02:26:10Z

@hudi-bot run azure

vinothchandar · 2022-03-22T08:50:54Z

@nsivabalan we can land this once CI passes

…xpressions (in lieu of just "literals"); Tidying up

… any other attributes or holding sub-query exprs

…ort arbitrary expressions involving no more than single column; Extracted common Column Stsat expression utils;

…umns in filters

…ie expression containing exactly one attribute reference, and no sub-queries)

Added test for fallback

…ingle pass

Added casting into list of allowed transformation exprs

Scaffolded `SparkAdapter` wiring

… 3.2.x; Repurposed `HoodieSpark3CatalystExpressionUtils` to support Spark 3.1.x (and maybe 3.0.x)

…sionUtils`

Scaffolded `Spark3_1Adapter`

…lumnRangeMetadata`

hudi-bot · 2022-03-25T03:45:00Z

CI report:

5bdb3ee Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…in Data Skipping flow (apache#4996)

alexeykudinkin changed the title ~~[HUDI-512] Supporting Composite Expressions over Data Table Columns in Data Skipping flow~~ [HUDI-512][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow Mar 9, 2022

alexeykudinkin changed the title ~~[HUDI-512][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow~~ [HUDI-3594][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow Mar 9, 2022

alexeykudinkin force-pushed the ak/mtmod-idx-2 branch 4 times, most recently from 4e5830d to e4a45b8 Compare March 15, 2022 00:53

nsivabalan added the priority:blocker Production down; release blocker label Mar 15, 2022

nsivabalan assigned vinothchandar Mar 15, 2022

alexeykudinkin changed the title ~~[HUDI-3594][Stacked on 4948] Supporting Composite Expressions over Data Table Columns in Data Skipping flow~~ [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow Mar 15, 2022

alexeykudinkin force-pushed the ak/mtmod-idx-2 branch from 4446457 to 922bfbe Compare March 15, 2022 19:22

yihua self-assigned this Mar 15, 2022

alexeykudinkin commented Mar 18, 2022

View reviewed changes

vinothchandar reviewed Mar 21, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/mtmod-idx-2 branch from eac505a to fccd455 Compare March 21, 2022 21:47

nsivabalan reviewed Mar 21, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/mtmod-idx-2 branch from fccd455 to ffb38fb Compare March 22, 2022 04:55

yihua removed their assignment Mar 23, 2022

Alexey Kudinkin added 9 commits March 24, 2022 17:39

Revisited Data Skipping utils to accept broader scope of "foldable" e…

b3337d7

…xpressions (in lieu of just "literals"); Tidying up

Expanded scope even further to include any expression not referencing…

c7a83ea

… any other attributes or holding sub-query exprs

Refactor Column Stats Index filter expression translation seq to supp…

02faeea

…ort arbitrary expressions involving no more than single column; Extracted common Column Stsat expression utils;

Added test applying Spark standard functions to source Data Table col…

8d4ee26

…umns in filters

Grouped together logically equivalent expressions

a69ddbc

Generalize all DS patterns to accept "Single Attribute Expressions" (…

e1a22bb

…ie expression containing exactly one attribute reference, and no sub-queries)

Added composite expression tests

2ff9d9a

Added test for non-literal value expression;

aa5a54d

Added test for fallback

Added tests for like, not like operators

b388434

Alexey Kudinkin added 25 commits March 24, 2022 17:39

Worked around bug in Spark not allowing to resolve expressions in a s…

59eef5b

…ingle pass

Added tests for composite expression (w/ nested function calls)

3f67769

Added HoodieSparkTypeUtils;

7cfc6ce

Added casting into list of allowed transformation exprs

Simplify expression resolution considerably

f5d2213

Fixing incorrect casting

f34df69

Tidying up java-docs

e7a2291

Fixing compilation

091a357

Tidying up

43d890a

Adding explicit type (Scala 2.11 not able to deduce it)

c4ffcc2

Tidying up

b0881ad

Scaffolded HoodieCatalystExpressionUtils as Spark-specific object;

2394573

Scaffolded `SparkAdapter` wiring

Bootstrapped Spark2 & Spark3 specific HoodieCatalystExpressionUtils

26936a5

Fixing refs

0c2b88c

Missing license

5572f82

Rebasing refs in DataSkippingUtils

94e96fb

Missing imports

074ac87

Fixing refs

9dd04d4

Tidying up

0f0d114

Inlined swapAttributeRefInExpr util

d72cae3

Branched out HoodieSpark3_2CatalystExpressionUtils to support Spark…

468d7fc

… 3.2.x; Repurposed `HoodieSpark3CatalystExpressionUtils` to support Spark 3.1.x (and maybe 3.0.x)

HoodieSpark3CatalystExpressionUtils > `HoodieSpark3_1CatalystExpres…

54a72dc

…sionUtils`

Rebased Spark3Adapter to become BaseSpark3Adapter;

fe59311

Scaffolded `Spark3_1Adapter`

Dangling ref

b06fe0d

Fixed ColumnStatsIndexHelper handling min/max values from `HoodieCo…

d320e14

…lumnRangeMetadata`

Tidying up

5bdb3ee

alexeykudinkin force-pushed the ak/mtmod-idx-2 branch from ffb38fb to 5bdb3ee Compare March 25, 2022 00:39

yihua merged commit 8b38dde into apache:master Mar 25, 2022

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3594] Supporting Composite Expressions over Data Table Columns …

6baccda

…in Data Skipping flow (apache#4996)

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022

[HUDI-3594] Supporting Composite Expressions over Data Table Columns …

5b66abf

…in Data Skipping flow (apache#4996)

[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow #4996

[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow #4996

Uh oh!

Conversation

alexeykudinkin commented Mar 9, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 22, 2022

Uh oh!

vinothchandar commented Mar 22, 2022

Uh oh!

hudi-bot commented Mar 25, 2022

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants