-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think a better fix is to let |
@cloud-fan Hi, I already implement |
|
cc @jchen5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading the doc of the collectConflictPlans function in this class. I think the problem we have now is there are more plan nodes than the leaf node that can produce new attributes, and we need to handle all of them. Project is not the only way, let's follow what plan nodes collectConflictPlans handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add all plan node which in collectConflictPlans. Please check again, but there are another problem. How to test it, seem produce an negative case not easy.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
3a29635 to
075ef46
Compare
075ef46 to
0c12ef9
Compare
| } | ||
| } | ||
|
|
||
| Seq(LeftSemi, LeftAnti).foreach { case jt => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test unnecessary. Because we can deduplicate those attributes in anti-join / semi-join is a self-join. Please refer #39131
| extends SubqueryExpression(plan, outerAttrs, exprId, joinCond, hint) with Unevaluable { | ||
| override def dataType: DataType = { | ||
| assert(plan.schema.fields.nonEmpty, "Scalar subquery should have only one column") | ||
| if (!plan.schema.fields.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure the AnalysisException will be throw, not AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really reach this code branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Usually this error will be thrown by checkAnalysis, but we may call datatype in DeduplicateRelations to cause this exception to be thrown. This change ensures that the thrown exception is consistent.
Change before:
Caused by: sbt.ForkMain$ForkError: java.lang.AssertionError: assertion failed: Scalar subquery should have only one column
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.dataType(subquery.scala:274)
at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:530)
at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:530)
at scala.PartialFunction.$anonfun$runWith$1$adapted(PartialFunction.scala:145)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.collect(TraversableLike.scala:407)
at scala.collection.TraversableLike.collect$(TraversableLike.scala:405)
at scala.collection.AbstractTraversable.collect(Traversable.scala:108)
at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.findAliases(DeduplicateRelations.scala:530)
at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.org$apache$spark$sql$catalyst$analysis$DeduplicateRelations$$renewDuplicatedRelations(DeduplicateRelations.scala:120)
at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:40)
at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:38)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Outdated
Show resolved
Hide resolved
|
The k8s failure is unrelated, I'm merging it to master, thanks! |
|
Thanks @cloud-fan for your help and @allisonwang-db |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Show resolved
Hide resolved
…dRelations` ### What changes were proposed in this pull request? This is a follow up PR for #41347 , add missing aggregate case in `renewDuplicatedRelations` ### Why are the changes needed? add missing case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. Closes #42160 from Hisoka-X/SPARK-43838_subquery_aggregate_follow_up. Authored-by: Jia Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Eg:
The error will throw:
The reason of error is the unresolved expression in
Joinnode which generate by subquery decorrelation. TheduplicateResolvedinJoinnode are false. That's meaning theJoinleft and right have sameAttribute, in this eg isc1#224. The rightc1#224Attributegenerated by having Inputs, because there are wrong having Inputs.This problem only occurs when there contain having clause.
also do some code format fix.
Why are the changes needed?
Fix subquery bug on single table when use having clause
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add new test