[SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union #19178

DonnyZone · 2017-09-10T09:31:36Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-21966

The problem can be reproduced by following example.

val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b") val df3 = df1.cube("a").sum("b") val df4 = df2.cube("a").sum("b") val df5 = df3.union(df4).filter("grouping_id()=0").show()

The org.apache.spark.sql.AnalysisException: cannot resolve 'spark_grouping_id' given input columns
is thrown as the ResolveMissingReference rule ignore the Union operator. This PR fix the issue.

How was this patch tested?

unit tests

AmplabJenkins · 2017-09-10T09:32:03Z

Can one of the admins verify this patch?

maropu · 2017-09-11T02:20:41Z

Could you trigger this test cuz I think this needs be fixed? @gatorsmile @HyukjinKwon

maropu · 2017-09-11T02:23:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

        case d: Distinct =>
          throw new AnalysisException(s"Can't add $missingAttrs to $d")
+        case u: Union =>
+          u.withNewChildren(u.children.map(addMissingAttr(_, missingAttrs)))


This is not the only issue in Union and I think binary operators have the same issue, e.g.,

scala> df3.join(df4).filter("grouping_id()=0").show() org.apache.spark.sql.AnalysisException: cannot resolve '`spark_grouping_id`' given input columns: [a, sum(b), a, sum(b)];; 'Filter ('spark_grouping_id = 0) +- Join Inner :- Aggregate [a#27, spark_grouping_id#25], [a#27, sum(cast(b#6 as bigint)) AS sum(b)#24L] : +- Expand [List(a#5, b#6, a#26, 0), List(a#5, b#6, null, 1)], [a#5, b#6, a#27, spark_grouping_id#25] : +- Project [a#5, b#6, a#5 AS a#26] : +- Project [_1#0 AS a#5, _2#1 AS b#6] : +- LocalRelation [_1#0, _2#1] +- Aggregate [a#38, spark_grouping_id#36], [a#38, sum(cast(b#16 as bigint)) AS sum(b)#35L] +- Expand [List(a#15, b#16, a#37, 0), List(a#15, b#16, null, 1)], [a#15, b#16, a#38, spark_grouping_id#36] +- Project [a#15, b#16, a#15 AS a#37] +- Project [_1#10 AS a#15, _2#11 AS b#16] +- LocalRelation [_1#10, _2#11]

So, we need more general solution for this case, I think.

Yeah, I agree with you. Current implementation only checks UnaryNode.
It is necessary to take all node types into consideration.
Thanks for suggestion, I will work on a general solution.

gatorsmile

The rule ResolveMissingReference was majorly added for sort, since the traditional RDBMS systems do it.

When we did this rule, we do not plan to support binary nodes. This rule could be very complex for a complete support. We still prefer users to adding the missing references in the query instead of adding them by Spark SQL.

maropu · 2017-09-11T05:08:50Z

Aha, you mean this is an expected design and we won't change this logic in ResolveMissingReference now, right?

gatorsmile · 2017-09-11T05:13:11Z

Yes. We do not plan to introduce extra complexity for improving this rule.

maropu · 2017-09-11T05:13:38Z

yea, thanks!

DonnyZone · 2017-09-11T06:49:18Z

Thanks, I will make a try in our private repository as there are several such cases and the users want to migrate in a seamless way. But I found it is really complicated for a general support.
Should I close this PR now?
@gatorsmile @maropu

maropu · 2017-09-11T07:11:17Z

yea, also could you close the jira ticket as Won't fix (or, Later) and the reason why there. Thanks!

SPARK-21966

29ae285

maropu reviewed Sep 11, 2017

View reviewed changes

gatorsmile reviewed Sep 11, 2017

View reviewed changes

DonnyZone closed this Sep 11, 2017

DonnyZone mentioned this pull request Oct 26, 2017

[SPARK-22350][SQL] select grouping__id from subquery #19573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union #19178

[SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union #19178

Uh oh!

DonnyZone commented Sep 10, 2017

Uh oh!

AmplabJenkins commented Sep 10, 2017

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

maropu Sep 11, 2017

Uh oh!

DonnyZone Sep 11, 2017

Uh oh!

gatorsmile left a comment

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

gatorsmile commented Sep 11, 2017

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

DonnyZone commented Sep 11, 2017 •

edited

Loading

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union #19178

[SPARK-21966][SQL]ResolveMissingReference rule should not ignore Union #19178

Uh oh!

Conversation

DonnyZone commented Sep 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 10, 2017

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

maropu Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

DonnyZone Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

gatorsmile commented Sep 11, 2017

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

DonnyZone commented Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DonnyZone commented Sep 11, 2017 •

edited

Loading