-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32344][SQL] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates #29143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
geektcp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
Test build #126038 has finished for PR 29143 at commit
|
| // [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is disallowed because those two distinct | ||
| // aggregates have different column expressions. | ||
| val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children | ||
| .filterNot(_.foldable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried select count(distinct 1) from v group by id but works fine, I think foldable is not the real problem here. I think the root cause is, FIRST/LAST put ignoreNulls as its children while the children are supposed to be the function inputs.
How about this fix
case class First(child: Expression, ignoreNulls: Boolean)
extends DeclarativeAggregate with ExpectsInputTypes {
def this(child: Expression) = this(child, false)
def this(child: Expression, ignoreNullsExpr: Expression) = {
this(child, First.validateIgnoreNullExpr(ignoreNullsExpr)) // follow HyperLogLogPlusPlus.validateDoubleLiteral
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks okay. I'll update.
| object FirstLast { | ||
| def validateIgnoreNullExpr(exp: Expression): Boolean = exp match { | ||
| case Literal(b: Boolean, BooleanType) => b | ||
| case _ => throw new AnalysisException("The second argument should be a boolean literal.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can pass the function name so that we can give a better error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
I think the description is out-of-dated. Can you also update the description? |
|
Test build #126057 has finished for PR 29143 at commit
|
|
Looks like related failure. |
|
Test build #126056 has finished for PR 29143 at commit
|
|
Ah, I see. Seems like I need to update the golden files.. |
|
Test build #126086 has finished for PR 29143 at commit
|
|
retest this please |
|
Test build #126098 has finished for PR 29143 at commit
|
| override def toString: String = s"$prettyName($child)${if (ignoreNulls) " ignore nulls"}" | ||
| } | ||
|
|
||
| object FirstLast { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deduplication is a little bit too much but I guess it's fine.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master and branch-3.0. |
…xpr in distinct aggregates
### What changes were proposed in this pull request?
This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master;
```
scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)
```
A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe;
https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76
So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case.
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added a test in `DataFrameAggregateSuite`.
Closes #29143 from maropu/SPARK-32344.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit c7a68a9)
Signed-off-by: HyukjinKwon <[email protected]>
|
@HyukjinKwon This shoulbe be backported to branch-2.4, too, for the 2.4.7 release? |
|
@maropu, sure feel free to port it back! |
|
okay, I will. Thanks for the check, @HyukjinKwon |
|
+1, late LGTM. Thank you all. |
…ullsExpr in distinct aggregates
### What changes were proposed in this pull request?
This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6;
```
scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)
```
A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe;
https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76
So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case.
This is the backport of #29143.
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added a test in `DataFrameAggregateSuite`.
Closes #29157 from maropu/SPARK-32344-BRANCH2.4.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master;
A root cause of this bug is that the
Aggregationstrategy replaces a foldable booleanignoreNullsExprexpr with aUnevaluableexpr (AttributeReference) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because theAnalyzerhas checked that it must be foldabe;spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
Lines 74 to 76 in ffdbbae
So, this PR proposes to change a vriable for
IGNORE NULLSfromExpressiontoBooleanto avoid the case.Why are the changes needed?
Bugfix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a test in
DataFrameAggregateSuite.