[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable #4982

huberylee · 2022-03-08T08:27:27Z

What is the purpose of the pull request

Refactor HoodieCommonUtils to make code more reasonable

Brief change log

Move some codes brought by [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL #4901 back to SparkHoodieTableFileIndex, BaseHoodieTableFileIndex and HoodieFileIndex
Move resolveFilterExpr method to DataSkippingUtils
Rename HoodieCommonUtils to HoodieCLIUtils
Using HoodieFileIndex to adapt partition pruning in RunClusteringProcedure

Verify this pull request

This pull request is already covered by existing tests, such as TestHoodieFileIndex, TestCallProcedure.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

huberylee · 2022-03-08T08:28:37Z

@alexeykudinkin Could you help me to review this pr? Thanks.

huberylee · 2022-03-08T10:07:25Z

@hudi-bot run azure

huberylee · 2022-03-08T13:05:11Z

@hudi-bot run azure

huberylee · 2022-03-08T16:46:32Z

@hudi-bot run azure

alexeykudinkin · 2022-03-09T20:04:03Z

...ark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/RunClusteringProcedure.scala

+      FileStatusCache.getOrCreate(sparkSession))
+
+    // Resolve partition predicates, only conjunctive predicates are supported
+    val partitionPredicate = DataSkippingUtils.resolveFilterExpr(sparkSession, predicate,


Let's create HoodieCatalystExpressionUtils and move resolveFilterExpr in there

alexeykudinkin · 2022-03-09T20:09:35Z

...ark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/RunClusteringProcedure.scala

+
+    // Get all partitions and prune partition by predicates
+    val partitionPaths = hoodieFileIndex.getAllCachedPartitionPaths.asScala.toSeq
+    val prunedPartitions = hoodieFileIndex.prunePartition(partitionPaths, predicates._2)


Instead of invoking pruning here directly, let's encapsulate pruning w/in the Index itself and then expose following API

class SparkHoodieTableFileIndex { def getPartitionPaths(List<Expression> predicates): Seq[PartitionPath] = { // prune internally } }

alexeykudinkin · 2022-03-09T20:13:56Z

...ark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/RunClusteringProcedure.scala

+    // Resolve partition predicates, only conjunctive predicates are supported
+    val partitionPredicate = DataSkippingUtils.resolveFilterExpr(sparkSession, predicate,
+      hoodieFileIndex.partitionSchema)
+    val predicates = splitPredicates(partitionPredicate)


I see that we're splitting but we don't separate b/w Data filters and Partition filters

And there's actually util doing what you need splitPartitionAndDataPredicates

Already use HoodieDataSourceHelper#splitPartitionAndDataPredicates method to split Data filters and Partition filters, and adding more tests.

alexeykudinkin · 2022-03-09T20:15:04Z

...ark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/RunClusteringProcedure.scala

+   * @return The pair of disjunctive predicates and conjunctive predicates
+   */
+  private def splitPredicates(condition: Expression): (Seq[Expression], Seq[Expression]) = {
+    condition match {


Please check my comment above regarding splitting

alexeykudinkin · 2022-03-09T20:15:44Z

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

  protected def parsePartitionColumnValues(partitionColumns: Array[String], partitionPath: String): Array[Object] = {
-    HoodieCommonUtils.parsePartitionColumnValues(sparkParsePartitionUtil, configProperties,
-      basePath, partitionSchema, partitionColumns, partitionPath)
+    if (partitionColumns.length == 0) {


You just moving this code back, right? There's no additional changes, are there?

Yes, I just moving this code back with no additional changes.

alexeykudinkin · 2022-03-09T20:16:03Z

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

+   * @param predicates     The filter condition.
+   * @return The Pruned partition paths.
+   */
+  def prunePartition(partitionPaths: Seq[PartitionPath], predicates: Seq[Expression]): Seq[PartitionPath] = {


Please check my comment above, and let's make this method private

alexeykudinkin · 2022-03-10T17:19:22Z

...di-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieCatalystExpressionUtils.scala

+ * under the License.
+ */
+
+package org.apache.spark.sql.hudi


Please move it to org.apache.spark.sql (so that we can access some package-private APIs)

alexeykudinkin · 2022-03-10T17:24:30Z

@huberylee thanks for cleaning things up!

huberylee · 2022-03-11T03:21:42Z

@hudi-bot run azure

huberylee · 2022-03-11T06:37:58Z

@hudi-bot run azure

hudi-bot · 2022-03-11T09:54:29Z

CI report:

282ca40 Azure: FAILURE Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

huberylee · 2022-03-11T11:25:17Z

@huberylee thanks for cleaning things up!

@alexeykudinkin All comments have been addressed.

…pache#4982)

alexeykudinkin reviewed Mar 9, 2022

View reviewed changes

huberylee added 3 commits March 10, 2022 16:43

Refactor HoodieCommonUtils to make code more reasonable

835ab34

Addressing review comments and adding more tests

a63a357

Fixing missing method after rebase

b59fae1

alexeykudinkin approved these changes Mar 10, 2022

View reviewed changes

Moving HoodieCatalystExpressionUtils to org.apache.spark.sql

282ca40

yihua merged commit 56cb494 into apache:master Mar 11, 2022

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (a…

181dd87

…pache#4982)

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022

[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (a…

04baf70

…pache#4982)

[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable #4982

[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable #4982

Uh oh!

Conversation

huberylee commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

huberylee commented Mar 8, 2022

Uh oh!

huberylee commented Mar 8, 2022

Uh oh!

huberylee commented Mar 8, 2022

Uh oh!

huberylee commented Mar 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 10, 2022

Uh oh!

huberylee commented Mar 11, 2022

Uh oh!

huberylee commented Mar 11, 2022

Uh oh!

hudi-bot commented Mar 11, 2022

CI report:

Uh oh!

huberylee commented Mar 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huberylee commented Mar 8, 2022 •

edited

Loading