-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow #4996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
b3337d7
Revisited Data Skipping utils to accept broader scope of "foldable" e…
c7a83ea
Expanded scope even further to include any expression not referencing…
02faeea
Refactor Column Stats Index filter expression translation seq to supp…
8d4ee26
Added test applying Spark standard functions to source Data Table col…
a69ddbc
Grouped together logically equivalent expressions
e1a22bb
Generalize all DS patterns to accept "Single Attribute Expressions" (…
2ff9d9a
Added composite expression tests
aa5a54d
Added test for non-literal value expression;
b388434
Added tests for `like`, `not like` operators
b32ee08
Tightened up permitted transformation expression to only accept ones …
3a5c1c9
Rebased allowed transformation matching seq to match permitted transf…
eea06bf
Extracted Expression utils to `HoodieCatalystExpressionUtils`
59eef5b
Worked around bug in Spark not allowing to resolve expressions in a s…
3f67769
Added tests for composite expression (w/ nested function calls)
7cfc6ce
Added `HoodieSparkTypeUtils`;
f5d2213
Simplify expression resolution considerably
f34df69
Fixing incorrect casting
e7a2291
Tidying up java-docs
091a357
Fixing compilation
43d890a
Tidying up
c4ffcc2
Adding explicit type (Scala 2.11 not able to deduce it)
b0881ad
Tidying up
2394573
Scaffolded `HoodieCatalystExpressionUtils` as Spark-specific object;
26936a5
Bootstrapped Spark2 & Spark3 specific `HoodieCatalystExpressionUtils`
0c2b88c
Fixing refs
5572f82
Missing license
94e96fb
Rebasing refs in `DataSkippingUtils`
074ac87
Missing imports
9dd04d4
Fixing refs
0f0d114
Tidying up
d72cae3
Inlined `swapAttributeRefInExpr` util
468d7fc
Branched out `HoodieSpark3_2CatalystExpressionUtils` to support Spark…
54a72dc
`HoodieSpark3CatalystExpressionUtils` > `HoodieSpark3_1CatalystExpres…
fe59311
Rebased `Spark3Adapter` to become `BaseSpark3Adapter`;
b06fe0d
Dangling ref
d320e14
Fixed `ColumnStatsIndexHelper` handling min/max values from `HoodieCo…
5bdb3ee
Tidying up
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
127 changes: 127 additions & 0 deletions
127
...hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql | ||
|
|
||
| import org.apache.spark.sql.catalyst.analysis.{UnresolvedAttribute, UnresolvedFunction} | ||
| import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, SubqueryExpression} | ||
| import org.apache.spark.sql.catalyst.plans.logical.{Filter, LocalRelation, LogicalPlan} | ||
| import org.apache.spark.sql.types.StructType | ||
|
|
||
| trait HoodieCatalystExpressionUtils { | ||
|
|
||
| /** | ||
| * Parses and resolves expression against the attributes of the given table schema. | ||
| * | ||
| * For example: | ||
| * <pre> | ||
| * ts > 1000 and ts <= 1500 | ||
| * </pre> | ||
| * will be resolved as | ||
| * <pre> | ||
| * And(GreaterThan(ts#590L > 1000), LessThanOrEqual(ts#590L <= 1500)) | ||
| * </pre> | ||
| * | ||
| * Where <pre>ts</pre> is a column of the provided [[tableSchema]] | ||
| * | ||
| * @param spark spark session | ||
| * @param exprString string representation of the expression to parse and resolve | ||
| * @param tableSchema table schema encompassing attributes to resolve against | ||
| * @return Resolved filter expression | ||
| */ | ||
| def resolveExpr(spark: SparkSession, exprString: String, tableSchema: StructType): Expression = { | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did not change |
||
| val expr = spark.sessionState.sqlParser.parseExpression(exprString) | ||
| resolveExpr(spark, expr, tableSchema) | ||
| } | ||
|
|
||
| /** | ||
| * Resolves provided expression (unless already resolved) against the attributes of the given table schema. | ||
| * | ||
| * For example: | ||
| * <pre> | ||
| * ts > 1000 and ts <= 1500 | ||
| * </pre> | ||
| * will be resolved as | ||
| * <pre> | ||
| * And(GreaterThan(ts#590L > 1000), LessThanOrEqual(ts#590L <= 1500)) | ||
| * </pre> | ||
| * | ||
| * Where <pre>ts</pre> is a column of the provided [[tableSchema]] | ||
| * | ||
| * @param spark spark session | ||
| * @param expr Catalyst expression to be resolved (if not yet) | ||
| * @param tableSchema table schema encompassing attributes to resolve against | ||
| * @return Resolved filter expression | ||
| */ | ||
| def resolveExpr(spark: SparkSession, expr: Expression, tableSchema: StructType): Expression = { | ||
| val analyzer = spark.sessionState.analyzer | ||
| val schemaFields = tableSchema.fields | ||
|
|
||
| val resolvedExpr = { | ||
| val plan: LogicalPlan = Filter(expr, LocalRelation(schemaFields.head, schemaFields.drop(1): _*)) | ||
| analyzer.execute(plan).asInstanceOf[Filter].condition | ||
| } | ||
|
|
||
| if (!hasUnresolvedRefs(resolvedExpr)) { | ||
| resolvedExpr | ||
| } else { | ||
| throw new IllegalStateException("unresolved attribute") | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Split the given predicates into two sequence predicates: | ||
| * - predicates that references partition columns only(and involves no sub-query); | ||
| * - other predicates. | ||
| * | ||
| * @param sparkSession The spark session | ||
| * @param predicates The predicates to be split | ||
| * @param partitionColumns The partition columns | ||
| * @return (partitionFilters, dataFilters) | ||
| */ | ||
| def splitPartitionAndDataPredicates(sparkSession: SparkSession, | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did not change |
||
| predicates: Array[Expression], | ||
| partitionColumns: Array[String]): (Array[Expression], Array[Expression]) = { | ||
| // Validates that the provided names both resolve to the same entity | ||
| val resolvedNameEquals = sparkSession.sessionState.analyzer.resolver | ||
|
|
||
| predicates.partition(expr => { | ||
| // Checks whether given expression only references partition columns(and involves no sub-query) | ||
| expr.references.forall(r => partitionColumns.exists(resolvedNameEquals(r.name, _))) && | ||
| !SubqueryExpression.hasSubquery(expr) | ||
| }) | ||
| } | ||
|
|
||
| /** | ||
| * Matches an expression iff | ||
| * | ||
| * <ol> | ||
| * <li>It references exactly one [[AttributeReference]]</li> | ||
| * <li>It contains only whitelisted transformations that preserve ordering of the source column [1]</li> | ||
| * </ol> | ||
| * | ||
| * [1] Preserving ordering is defined as following: transformation T is defined as ordering preserving in case | ||
| * values of the source column A values being ordered as a1, a2, a3 ..., will map into column B = T(A) which | ||
| * will keep the same ordering b1, b2, b3, ... with b1 = T(a1), b2 = T(a2), ... | ||
| */ | ||
| def tryMatchAttributeOrderingPreservingTransformation(expr: Expression): Option[AttributeReference] | ||
|
|
||
| private def hasUnresolvedRefs(resolvedExpr: Expression): Boolean = | ||
| resolvedExpr.collectFirst { | ||
| case _: UnresolvedAttribute | _: UnresolvedFunction => true | ||
| }.isDefined | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this code adapted from somewhere? if so, can you please add source attribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is our code. Had to place it in
spark.sqlto access package-private API