[SPARK-32344][SQL] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates #29143

maropu · 2020-07-17T07:22:48Z

What changes were proposed in this pull request?

This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master;

scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
  at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
  at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)

A root cause of this bug is that the Aggregation strategy replaces a foldable boolean ignoreNullsExpr expr with a Unevaluable expr (AttributeReference) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the Analyzer has checked that it must be foldabe;

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala

Lines 74 to 76 in ffdbbae

    
           } else if (!ignoreNullsExpr.foldable) { 
        
             TypeCheckFailure( 
        
               s"The second argument of First must be a boolean literal, but got: ${ignoreNullsExpr.sql}")

So, this PR proposes to change a vriable for IGNORE NULLS from Expression to Boolean to avoid the case.

Why are the changes needed?

Bugfix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a test in DataFrameAggregateSuite.

geektcp

nice

SparkQA · 2020-07-17T12:14:27Z

Test build #126038 has finished for PR 29143 at commit 0e30a25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-17T12:17:13Z

cc: @cloud-fan @viirya @dongjoon-hyun

cloud-fan · 2020-07-17T14:10:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

            // [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is disallowed because those two distinct
            // aggregates have different column expressions.
            val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
+              .filterNot(_.foldable)


I tried select count(distinct 1) from v group by id but works fine, I think foldable is not the real problem here. I think the root cause is, FIRST/LAST put ignoreNulls as its children while the children are supposed to be the function inputs.

How about this fix

case class First(child: Expression, ignoreNulls: Boolean) extends DeclarativeAggregate with ExpectsInputTypes { def this(child: Expression) = this(child, false) def this(child: Expression, ignoreNullsExpr: Expression) = { this(child, First.validateIgnoreNullExpr(ignoreNullsExpr)) // follow HyperLogLogPlusPlus.validateDoubleLiteral }

Ah, looks okay. I'll update.

cloud-fan · 2020-07-17T14:55:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala

+object FirstLast {
+  def validateIgnoreNullExpr(exp: Expression): Boolean = exp match {
+    case Literal(b: Boolean, BooleanType) => b
+    case _ => throw new AnalysisException("The second argument should be a boolean literal.")


maybe we can pass the function name so that we can give a better error message.

viirya · 2020-07-17T15:49:04Z

I think the description is out-of-dated. Can you also update the description?

SparkQA · 2020-07-17T17:06:14Z

Test build #126057 has finished for PR 29143 at commit 92f5b7d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-17T18:18:02Z

Looks like related failure.

SparkQA · 2020-07-17T19:20:06Z

Test build #126056 has finished for PR 29143 at commit a83de82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class First(child: Expression, ignoreNulls: Boolean)
case class Last(child: Expression, ignoreNulls: Boolean)

maropu · 2020-07-17T23:15:22Z

Ah, I see. Seems like I need to update the golden files..

SparkQA · 2020-07-18T07:05:01Z

Test build #126086 has finished for PR 29143 at commit 66bf522.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-18T07:25:20Z

retest this please

SparkQA · 2020-07-18T12:20:14Z

Test build #126098 has finished for PR 29143 at commit 66bf522.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-19T02:11:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala

  override def toString: String = s"$prettyName($child)${if (ignoreNulls) " ignore nulls"}"
 }
+
+object FirstLast {


I think this deduplication is a little bit too much but I guess it's fine.

HyukjinKwon

LGTM

HyukjinKwon · 2020-07-19T02:12:08Z

Merged to master and branch-3.0.

…xpr in distinct aggregates ### What changes were proposed in this pull request? This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master; ``` scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show() ... Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258) at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268) ``` A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe; https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76 So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test in `DataFrameAggregateSuite`. Closes #29143 from maropu/SPARK-32344. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit c7a68a9) Signed-off-by: HyukjinKwon <[email protected]>

maropu · 2020-07-20T01:54:29Z

@HyukjinKwon This shoulbe be backported to branch-2.4, too, for the 2.4.7 release?

HyukjinKwon · 2020-07-20T02:11:27Z

@maropu, sure feel free to port it back!

maropu · 2020-07-20T02:12:25Z

okay, I will. Thanks for the check, @HyukjinKwon

dongjoon-hyun · 2020-07-20T06:31:19Z

+1, late LGTM. Thank you all.

…ullsExpr in distinct aggregates ### What changes were proposed in this pull request? This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6; ``` scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show() ... Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258) at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268) ``` A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe; https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76 So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case. This is the backport of #29143. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test in `DataFrameAggregateSuite`. Closes #29157 from maropu/SPARK-32344-BRANCH2.4. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Fix

0e30a25

probot-autolabeler bot added the SQL label Jul 17, 2020

maropu mentioned this pull request Jul 17, 2020

[SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs #29056

Closed

geektcp approved these changes Jul 17, 2020

View reviewed changes

cloud-fan reviewed Jul 17, 2020

View reviewed changes

Fix

a83de82

cloud-fan reviewed Jul 17, 2020

View reviewed changes

cloud-fan approved these changes Jul 17, 2020

View reviewed changes

Fix

92f5b7d

viirya approved these changes Jul 17, 2020

View reviewed changes

Fix

66bf522

HyukjinKwon reviewed Jul 19, 2020

View reviewed changes

HyukjinKwon approved these changes Jul 19, 2020

View reviewed changes

HyukjinKwon closed this in c7a68a9 Jul 19, 2020

maropu mentioned this pull request Jul 20, 2020

[SPARK-32344][SQL][2.4] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates #29157

Closed

	} else if (!ignoreNullsExpr.foldable) {
	TypeCheckFailure(
	s"The second argument of First must be a boolean literal, but got: ${ignoreNullsExpr.sql}")

[SPARK-32344][SQL] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates #29143

[SPARK-32344][SQL] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates #29143

Uh oh!

Conversation

maropu commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

geektcp left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

maropu commented Jul 17, 2020

Uh oh!

cloud-fan Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

viirya commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

maropu commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

maropu commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

HyukjinKwon Jul 19, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 19, 2020

Uh oh!

maropu commented Jul 20, 2020

Uh oh!

HyukjinKwon commented Jul 20, 2020

Uh oh!

maropu commented Jul 20, 2020

Uh oh!

dongjoon-hyun commented Jul 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

maropu commented Jul 17, 2020 •

edited

Loading