Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Jul 17, 2020

What changes were proposed in this pull request?

This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master;

scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
  at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
  at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)

A root cause of this bug is that the Aggregation strategy replaces a foldable boolean ignoreNullsExpr expr with a Unevaluable expr (AttributeReference) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the Analyzer has checked that it must be foldabe;

} else if (!ignoreNullsExpr.foldable) {
TypeCheckFailure(
s"The second argument of First must be a boolean literal, but got: ${ignoreNullsExpr.sql}")

So, this PR proposes to change a vriable for IGNORE NULLS from Expression to Boolean to avoid the case.

Why are the changes needed?

Bugfix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a test in DataFrameAggregateSuite.

Copy link

@geektcp geektcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126038 has finished for PR 29143 at commit 0e30a25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 17, 2020

cc: @cloud-fan @viirya @dongjoon-hyun

// [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is disallowed because those two distinct
// aggregates have different column expressions.
val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
.filterNot(_.foldable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried select count(distinct 1) from v group by id but works fine, I think foldable is not the real problem here. I think the root cause is, FIRST/LAST put ignoreNulls as its children while the children are supposed to be the function inputs.

How about this fix

case class First(child: Expression, ignoreNulls: Boolean)
  extends DeclarativeAggregate with ExpectsInputTypes {

  def this(child: Expression) = this(child, false)

  def this(child: Expression, ignoreNullsExpr: Expression) = {
    this(child, First.validateIgnoreNullExpr(ignoreNullsExpr)) // follow HyperLogLogPlusPlus.validateDoubleLiteral
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks okay. I'll update.

object FirstLast {
def validateIgnoreNullExpr(exp: Expression): Boolean = exp match {
case Literal(b: Boolean, BooleanType) => b
case _ => throw new AnalysisException("The second argument should be a boolean literal.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can pass the function name so that we can give a better error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@viirya
Copy link
Member

viirya commented Jul 17, 2020

I think the description is out-of-dated. Can you also update the description?

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126057 has finished for PR 29143 at commit 92f5b7d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 17, 2020

Looks like related failure.

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126056 has finished for PR 29143 at commit a83de82.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class First(child: Expression, ignoreNulls: Boolean)
  • case class Last(child: Expression, ignoreNulls: Boolean)

@maropu
Copy link
Member Author

maropu commented Jul 17, 2020

Ah, I see. Seems like I need to update the golden files..

@SparkQA
Copy link

SparkQA commented Jul 18, 2020

Test build #126086 has finished for PR 29143 at commit 66bf522.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jul 18, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 18, 2020

Test build #126098 has finished for PR 29143 at commit 66bf522.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def toString: String = s"$prettyName($child)${if (ignoreNulls) " ignore nulls"}"
}

object FirstLast {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deduplication is a little bit too much but I guess it's fine.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request Jul 19, 2020
…xpr in distinct aggregates

### What changes were proposed in this pull request?

This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master;
```
scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
  at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
  at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)
```
A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe;
https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76
So, this PR proposes to change a vriable for `IGNORE NULLS`  from `Expression` to `Boolean` to avoid the case.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a test in `DataFrameAggregateSuite`.

Closes #29143 from maropu/SPARK-32344.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit c7a68a9)
Signed-off-by: HyukjinKwon <[email protected]>
@maropu
Copy link
Member Author

maropu commented Jul 20, 2020

@HyukjinKwon This shoulbe be backported to branch-2.4, too, for the 2.4.7 release?

@HyukjinKwon
Copy link
Member

@maropu, sure feel free to port it back!

@maropu
Copy link
Member Author

maropu commented Jul 20, 2020

okay, I will. Thanks for the check, @HyukjinKwon

@dongjoon-hyun
Copy link
Member

+1, late LGTM. Thank you all.

dongjoon-hyun pushed a commit that referenced this pull request Jul 20, 2020
…ullsExpr in distinct aggregates

### What changes were proposed in this pull request?

This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6;
```
scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
  at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
  at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)
```
A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe;
https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76
So, this PR proposes to change a vriable for `IGNORE NULLS`  from `Expression` to `Boolean` to avoid the case.

This is the backport of #29143.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a test in `DataFrameAggregateSuite`.

Closes #29157 from maropu/SPARK-32344-BRANCH2.4.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants