[SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Containing GroupBy Columns #11100

gatorsmile · 2016-02-06T01:58:32Z

Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.

This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6

For example, the following query returns a wrong result:

sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
     " grouping sets((), (course), (course, earnings))" +
     " order by course, sum").show()

Before the fix, the results are like

[null,null]
[Java,null]
[Java,20000.0]
[Java,30000.0]
[dotNET,null]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]

After the fix, the results become correct:

[null,113000.0]
[Java,20000.0]
[Java,30000.0]
[Java,50000.0]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]
[dotNET,63000.0]

UPDATE: This PR also deprecated the external column: GROUPING__ID.

SparkQA · 2016-02-06T03:15:24Z

Test build #50855 has finished for PR 11100 at commit 2f9eeb9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-06T10:04:55Z

Test build #50869 has finished for PR 11100 at commit 18f4130.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-06T17:02:01Z

test this please

SparkQA · 2016-02-06T18:24:21Z

Test build #50872 has finished for PR 11100 at commit 18f4130.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-07T08:02:17Z

retest this please

SparkQA · 2016-02-07T09:29:33Z

Test build #50892 has finished for PR 11100 at commit 18f4130.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-08T01:28:31Z

retest this please

gatorsmile · 2016-02-08T01:36:14Z

cc @yhuai @marmbrus @rxin Could you check if this one is an appropriate fix? After checking the history, I found you are involved the discussion in the original fix of rollup and cube. : )

@liancheng This is blocking the JIRA https://issues.apache.org/jira/browse/SPARK-12720. Will submit a PR after this issue is addressed. Sorry for the delays. Thanks!

gatorsmile · 2016-02-08T01:38:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        val aggExprs = g.aggregations.map(_.transform {
+          case u: UnresolvedAttribute if resolver(u.name, VirtualColumn.groupingIdName) => gid
+        }.asInstanceOf[NamedExpression])
+        g.copy(aggregations = aggExprs, groupByExprs = g.groupByExprs :+ gid)


This one has a potential bug. Will fix it soon.

SparkQA · 2016-02-08T02:58:08Z

Test build #50907 has finished for PR 11100 at commit 18f4130.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-08T05:10:57Z

Test build #50909 has finished for PR 11100 at commit 114e0eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-08T05:16:49Z

retest this please

SparkQA · 2016-02-08T06:36:36Z

Test build #50910 has finished for PR 11100 at commit 114e0eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-08T15:52:26Z

retest this please

SparkQA · 2016-02-08T17:31:36Z

Test build #50925 has finished for PR 11100 at commit 114e0eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-11T04:33:06Z

retest this please

SparkQA · 2016-02-11T05:53:54Z

Test build #51080 has finished for PR 11100 at commit 114e0eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-11T06:43:05Z

@davies @hvanhovell @aray I just realized you just reviewed a related PR: #10677. Could you also review this one? I just checked the results are still wrong in the latest code after the merge of #10677. Thanks!

SparkQA · 2016-02-11T07:54:50Z

Test build #51087 has finished for PR 11100 at commit 524dfa0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2016-02-11T14:37:43Z

LGTM

SparkQA · 2016-02-11T15:07:18Z

Test build #51100 has started for PR 11100 at commit e62c3d0.

shaneknapp · 2016-02-11T16:08:53Z

jenkins, test this please

SparkQA · 2016-02-11T17:49:37Z

Test build #51111 has finished for PR 11100 at commit e62c3d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-11T22:10:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        else {
+          g
+        }
+      case x: GroupingSets =>


Add if g.expressions.forall(_.resolved) to make sure that all the expression are all resolved.

davies · 2016-02-11T22:20:10Z

@gatorsmile Thanks for work on this.

The secret column name GROUPING__ID is only introduced by Hive, unfortunately the implementation is wrong, we don't follow that(corrected it). We can't be compatible with Hive anyway, so I'd like to not support it (it's OK for 2.0), we can have an error message to tell user to use grouping_id().

For the other bug, that could be fixed by resolve all the expressions before GroupingSet (one line change).

davies · 2016-02-11T22:20:47Z

@gatorsmile checked that your two tests could pass with these two tiny changes.

gatorsmile · 2016-02-12T04:42:09Z

Thank you! @davies @aray

Yeah, my first fix is very similar to what you proposed above. Will remember what you said regarding GROUPING__ID. After the release of 2.0, I will try to deprecate it and issue an error message.

BTW, just tried the code changes and it works well in my local environment. Updated the codes. Thanks!

SparkQA · 2016-02-12T06:15:56Z

Test build #51169 has finished for PR 11100 at commit 79c11de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-12T17:29:23Z

@gatorsmile I think the first commit should be enough. Since 2.0 is the best chance to deprecate GROUPING__ID, we should do that BEFORE release 2.0.

gatorsmile · 2016-02-12T17:37:01Z

Sure, will revert the changes. Thank you!

SparkQA · 2016-02-12T21:21:34Z

Test build #51197 has finished for PR 11100 at commit ed518f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-13T08:14:22Z

Test build #51229 has finished for PR 11100 at commit 7631371.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-13T08:40:50Z

@davies Already deprecated GROUPING__ID in the latest commits and also cleaned all the test cases.

Also tried to output a better error message when users manually specify GROUPING__ID in the query, but I hit an issue. To compare the column names, we need to pass conf or resolver to the function checkAnalysis. This will touch multiple files that call this function. I am not sure if it is worthy for this purpose. Let me know if you want me to do it. Thanks!

  def resolver: Resolver = {
    if (conf.caseSensitiveAnalysis) {
      caseSensitiveResolution
    } else {
      caseInsensitiveResolution
    }
  }

davies · 2016-02-14T01:27:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

        GroupingSets(bitmasks(r), groupByExprs, child, aggregateExpressions)
+      // Ensure all the expressions have been resolved.
+      case g: GroupingSets if g.expressions.exists(!_.resolved) => g
      case x: GroupingSets =>


it's more clear if you move the if to next case

Ok, will do. Thanks!

davies · 2016-02-14T01:32:41Z

@gatorsmile I think you can check GROUPING__ID in ResolveGroupingAnalytics, then raise an error

gatorsmile · 2016-02-14T17:08:21Z

@davies As you suggested, the latest commit has the following changes:

Replaced GROUPING__ID by grouping_id() in all the test cases and move the results into the test cases
Moved all these test cases in hive/execution/HiveQuerySuite.scala to hive/execution/SQLQuerySuite.scala.
Added codes for detecting the possible usage errors in the rule ResolveGroupingAnalytics and issue an error message if necessary

Thanks!

hvanhovell · 2016-02-14T19:00:09Z

retest this please

SparkQA · 2016-02-14T19:02:16Z

Test build #51273 has finished for PR 11100 at commit 1ca66ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-02-14T19:22:41Z

Whoops triggered build unnecessarily

hvanhovell · 2016-02-14T19:25:38Z

@gatorsmile do we still need VirtualColumn? Other than that LGTM.

gatorsmile · 2016-02-14T19:40:40Z

@hvanhovell Thank you for your reviews!

Although we deprecate GROUPING__ID, the new grouping_id() still requires VirtualColumn.

I like this change. Users are not allowed to select/query this hidden/secret column now.

SparkQA · 2016-02-14T21:10:11Z

Test build #51275 has finished for PR 11100 at commit 1ca66ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-16T07:16:35Z

LGTM, we keep the VirtualColumn to show a better error message, merging this into master, thanks!

fixing grouping sets

2f9eeb9

fix the issue when users manually specify grouping__id in the query.

18f4130

gatorsmile reviewed Feb 8, 2016
View reviewed changes

fixed a bug.

114e0eb

fixed the duplicate test case name.

524dfa0

gatorsmile added 2 commits February 11, 2016 06:46

Merge remote-tracking branch 'upstream/master' into groupingSets

29da9ef

test case fix.

e62c3d0

davies reviewed Feb 11, 2016
View reviewed changes

address comments.

79c11de

gatorsmile added 2 commits February 12, 2016 12:08

address comments.

d12c07b

removed the test case using grouping__id

ed518f9

removed GROUPING__ID from the test cases

7631371

davies reviewed Feb 14, 2016
View reviewed changes

gatorsmile added 2 commits February 14, 2016 08:55

address comments.

b1623be

removed unnecessary golden result files.

1ca66ac

asfgit closed this in fee739f Feb 16, 2016

[SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Containing GroupBy Columns #11100

[SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Containing GroupBy Columns #11100

Uh oh!

Conversation

gatorsmile commented Feb 6, 2016

Uh oh!

SparkQA commented Feb 6, 2016

Uh oh!

SparkQA commented Feb 6, 2016

Uh oh!

gatorsmile commented Feb 6, 2016

Uh oh!

SparkQA commented Feb 6, 2016

Uh oh!

gatorsmile commented Feb 7, 2016

Uh oh!

SparkQA commented Feb 7, 2016

Uh oh!

gatorsmile commented Feb 8, 2016

Uh oh!

gatorsmile commented Feb 8, 2016

Uh oh!

gatorsmile Feb 8, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

gatorsmile commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

gatorsmile commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

gatorsmile commented Feb 11, 2016

Uh oh!

SparkQA commented Feb 11, 2016

Uh oh!

gatorsmile commented Feb 11, 2016

Uh oh!

SparkQA commented Feb 11, 2016

Uh oh!

aray commented Feb 11, 2016

Uh oh!

SparkQA commented Feb 11, 2016

Uh oh!

shaneknapp commented Feb 11, 2016

Uh oh!

SparkQA commented Feb 11, 2016

Uh oh!

davies Feb 11, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented Feb 11, 2016

Uh oh!

davies commented Feb 11, 2016

Uh oh!

gatorsmile commented Feb 12, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

davies commented Feb 12, 2016

Uh oh!

gatorsmile commented Feb 12, 2016

Uh oh!

SparkQA commented Feb 12, 2016

Uh oh!

SparkQA commented Feb 13, 2016

Uh oh!

gatorsmile commented Feb 13, 2016

Uh oh!

davies Feb 14, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 14, 2016

Choose a reason for hiding this comment

Uh oh!