Skip to content

Conversation

@huberylee
Copy link
Contributor

@huberylee huberylee commented Mar 8, 2022

What is the purpose of the pull request

Refactor HoodieCommonUtils to make code more reasonable

Brief change log

Verify this pull request

This pull request is already covered by existing tests, such as TestHoodieFileIndex, TestCallProcedure.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@huberylee
Copy link
Contributor Author

@alexeykudinkin Could you help me to review this pr? Thanks.

@huberylee
Copy link
Contributor Author

@hudi-bot run azure

2 similar comments
@huberylee
Copy link
Contributor Author

@hudi-bot run azure

@huberylee
Copy link
Contributor Author

@hudi-bot run azure

FileStatusCache.getOrCreate(sparkSession))

// Resolve partition predicates, only conjunctive predicates are supported
val partitionPredicate = DataSkippingUtils.resolveFilterExpr(sparkSession, predicate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create HoodieCatalystExpressionUtils and move resolveFilterExpr in there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Get all partitions and prune partition by predicates
val partitionPaths = hoodieFileIndex.getAllCachedPartitionPaths.asScala.toSeq
val prunedPartitions = hoodieFileIndex.prunePartition(partitionPaths, predicates._2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of invoking pruning here directly, let's encapsulate pruning w/in the Index itself and then expose following API

class SparkHoodieTableFileIndex {
  def getPartitionPaths(List<Expression> predicates): Seq[PartitionPath] = {
    // prune internally
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Resolve partition predicates, only conjunctive predicates are supported
val partitionPredicate = DataSkippingUtils.resolveFilterExpr(sparkSession, predicate,
hoodieFileIndex.partitionSchema)
val predicates = splitPredicates(partitionPredicate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're splitting but we don't separate b/w Data filters and Partition filters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there's actually util doing what you need splitPartitionAndDataPredicates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already use HoodieDataSourceHelper#splitPartitionAndDataPredicates method to split Data filters and Partition filters, and adding more tests.

* @return The pair of disjunctive predicates and conjunctive predicates
*/
private def splitPredicates(condition: Expression): (Seq[Expression], Seq[Expression]) = {
condition match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comment above regarding splitting

protected def parsePartitionColumnValues(partitionColumns: Array[String], partitionPath: String): Array[Object] = {
HoodieCommonUtils.parsePartitionColumnValues(sparkParsePartitionUtil, configProperties,
basePath, partitionSchema, partitionColumns, partitionPath)
if (partitionColumns.length == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just moving this code back, right? There's no additional changes, are there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just moving this code back with no additional changes.

* @param predicates The filter condition.
* @return The Pruned partition paths.
*/
def prunePartition(partitionPaths: Seq[PartitionPath], predicates: Seq[Expression]): Seq[PartitionPath] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comment above, and let's make this method private

* under the License.
*/

package org.apache.spark.sql.hudi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it to org.apache.spark.sql (so that we can access some package-private APIs)

@alexeykudinkin
Copy link
Contributor

@huberylee thanks for cleaning things up!

@huberylee
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@huberylee
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@huberylee
Copy link
Contributor Author

@huberylee thanks for cleaning things up!

@alexeykudinkin All comments have been addressed.

@yihua yihua merged commit 56cb494 into apache:master Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants