[SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude inputEncoders #29106

Ngone51 · 2020-07-14T16:45:36Z

What changes were proposed in this pull request?

Override canonicalized to empty the inputEncoders for the canonicalized ScalaUDF.

Why are the changes needed?

The following fails on the Master branch currently.

spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")
checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)

[info]   org.apache.spark.sql.AnalysisException: expression 't.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
[info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8]
[info] +- SubqueryAlias t
[info]    +- Project [value#3 AS a#6]
[info]       +- LocalRelation [value#3]
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
[info]   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[info]   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
...

We use the ruleResolveEncodersInUDF to resolve inputEncoders and the originalScalaUDF instance will be updated to a new ScalaUDF instance with the resolved encoders at the end. Note, during encoder resolving, types like map, array will be resolved to new expression(e.g. MapObjects, CatalystToExternalMap).

However, ExpressionEncoder can't be canonicalized. Thus, the canonicalized ScalaUDFs become different even if their original ScalaUDFs are the same. Finally, it fails the checkValidAggregateExpression when this ScalaUDF is used as a group expression.

Does this PR introduce any user-facing change?

Yes, users will not hit the exception after this fix.

How was this patch tested?

Added tests.

dongjoon-hyun

Hi, @Ngone51 .

The JIRA description and this PR description is misleading because it's not reproducible in the vanilla Apache Spark. Could you be more precise on the contribution of this PR?

scala> spark.version
res0: String = 3.0.0

scala> spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
res1: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$1954/1561881364@8937f62,IntegerType,List(Some(class[value[0]: map<string,string>])),None,false,true)

scala> Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")

scala> sql("SELECT key(a) AS k FROM t GROUP BY key(a)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([1])

dongjoon-hyun · 2020-07-14T19:07:49Z

If this exists on branch-3.0 and is not released yet, the affected version should be 3.0.1 instead of 3.0.0.

dongjoon-hyun

+1, LGTM. Thank you, @Ngone51 and @cloud-fan . I revised the JIRA and the PR description here.
Merged to master/3.0.

… inputEncoders ### What changes were proposed in this pull request? Override `canonicalized` to empty the `inputEncoders` for the canonicalized `ScalaUDF`. ### Why are the changes needed? The following fails on `branch-3.0` currently, not on Apache Spark 3.0.0 release. ```scala spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) [info] org.apache.spark.sql.AnalysisException: expression 't.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8] [info] +- SubqueryAlias t [info] +- Project [value#3 AS a#6] [info] +- LocalRelation [value#3] [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) [info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) ... ``` We use the rule`ResolveEncodersInUDF` to resolve `inputEncoders` and the original`ScalaUDF` instance will be updated to a new `ScalaUDF` instance with the resolved encoders at the end. Note, during encoder resolving, types like `map`, `array` will be resolved to new expression(e.g. `MapObjects`, `CatalystToExternalMap`). However, `ExpressionEncoder` can't be canonicalized. Thus, the canonicalized `ScalaUDF`s become different even if their original `ScalaUDF`s are the same. Finally, it fails the `checkValidAggregateExpression` when this `ScalaUDF` is used as a group expression. ### Does this PR introduce _any_ user-facing change? Yes, users will not hit the exception after this fix. ### How was this patch tested? Added tests. Closes #29106 from Ngone51/spark-32307. Authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a47b69a) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2020-07-14T19:26:40Z

BTW, @cloud-fan and @HyukjinKwon . I merged this because GitHub Action passed.

SparkQA · 2020-07-14T22:44:46Z

Test build #125846 has finished for PR 29106 at commit 427d112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-15T01:22:07Z

late LGTM, tahnks, @Ngone51

Ngone51 · 2020-07-15T07:22:49Z

thanks all!!

dongjoon-hyun · 2020-07-16T00:45:18Z

Hi, @Ngone51. Could you make a backporting PR for branch-3.0 again?
branch-3.0 seems to have a different reason for this fix.

cloud-fan · 2020-07-16T06:14:36Z

Merged to master/3.0.

has it been merged to 3.0 or not?

Ngone51 · 2020-07-16T08:39:39Z

It has been reverted at 4ef535f.

branch-3.0 build fail at test SPARK-32307: Aggression that use array type input UDF as group expression since it does not support udf with type Array[Int]. Actually, yet, only the Master branch supports Array[Int] type after SPARK-31826.

This PR fixes the issue which introduced by SPARK-31826 and branch-3.0 actually doesn't have the issue but fail by the unsupported Array type.

So, we may not need to backport it to branch-3.0?

cloud-fan · 2020-07-16T08:43:47Z

OK, then let's not backport. @Ngone51 can you update the PR description to make it clear?

Ngone51 · 2020-07-16T08:44:41Z

Sure

dongjoon-hyun · 2020-07-16T16:49:32Z

Thank you for clarification!

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

cloud-fan · 2021-12-13T08:23:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

+  override lazy val canonicalized: Expression = {
+    // SPARK-32307: `ExpressionEncoder` can't be canonicalized, and technically we don't
+    // need it to identify a `ScalaUDF`.
+    Canonicalize.execute(copy(children = children.map(_.canonicalized), inputEncoders = Nil))


@Ngone51 shall we do the same for outputEncoder?

Make sense. I'll do a follow-up.

…preCanonicalized ### What changes were proposed in this pull request? This PR proposes to set `outputEncoder` to `None` for `ScalaUDF.preCanonicalized`. ### Why are the changes needed? We once did the same thing to `inputEncoders` in #29106 to fix a bug where the canonicalized ScalaUDFs for the same ScalaUDF becomes different after resolving `inputEncoders`. So this PR applies the same fix to `outputEncoder` to avoid hitting the same issue in the future. Note that we don't have the issue caused by `outputEncoder` now since we don't resolve `outputEncoder` yet. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #34937 from Ngone51/SPARK-32307-followup. Authored-by: yi.wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

fix

427d112

probot-autolabeler bot added the SQL label Jul 14, 2020

cloud-fan approved these changes Jul 14, 2020

View reviewed changes

dongjoon-hyun requested changes Jul 14, 2020

View reviewed changes

dongjoon-hyun approved these changes Jul 14, 2020

View reviewed changes

dongjoon-hyun closed this in a47b69a Jul 14, 2020

MaxGekk reviewed Mar 8, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala Show resolved Hide resolved

cloud-fan reviewed Dec 13, 2021

View reviewed changes

Ngone51 mentioned this pull request Dec 17, 2021

[SPARK-32307][FOLLOW-UP][SQL] Set outputEncoder to None for ScalaUDF.preCanonicalized #34937

Closed

[SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude inputEncoders #29106

[SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude inputEncoders #29106

Uh oh!

Conversation

Ngone51 commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 14, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

maropu commented Jul 15, 2020

Uh oh!

Ngone51 commented Jul 15, 2020

Uh oh!

dongjoon-hyun commented Jul 16, 2020

Uh oh!

cloud-fan commented Jul 16, 2020

Uh oh!

Ngone51 commented Jul 16, 2020

Uh oh!

cloud-fan commented Jul 16, 2020

Uh oh!

Ngone51 commented Jul 16, 2020

Uh oh!

dongjoon-hyun commented Jul 16, 2020

Uh oh!

Uh oh!

cloud-fan Dec 13, 2021

Choose a reason for hiding this comment

Uh oh!

Ngone51 Dec 14, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Ngone51 commented Jul 14, 2020 •

edited

Loading