Skip to content

Conversation

@gatorsmile
Copy link
Member

Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.

This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6

For example, the following query returns a wrong result:

sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
     " grouping sets((), (course), (course, earnings))" +
     " order by course, sum").show()

Before the fix, the results are like

[null,null]
[Java,null]
[Java,20000.0]
[Java,30000.0]
[dotNET,null]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]

After the fix, the results become correct:

[null,113000.0]
[Java,20000.0]
[Java,30000.0]
[Java,50000.0]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]
[dotNET,63000.0]

UPDATE: This PR also deprecated the external column: GROUPING__ID.

@SparkQA
Copy link

SparkQA commented Feb 6, 2016

Test build #50855 has finished for PR 11100 at commit 2f9eeb9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 6, 2016

Test build #50869 has finished for PR 11100 at commit 18f4130.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Feb 6, 2016

Test build #50872 has finished for PR 11100 at commit 18f4130.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 7, 2016

Test build #50892 has finished for PR 11100 at commit 18f4130.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@gatorsmile
Copy link
Member Author

cc @yhuai @marmbrus @rxin Could you check if this one is an appropriate fix? After checking the history, I found you are involved the discussion in the original fix of rollup and cube. : )

@liancheng This is blocking the JIRA https://issues.apache.org/jira/browse/SPARK-12720. Will submit a PR after this issue is addressed. Sorry for the delays. Thanks!

val aggExprs = g.aggregations.map(_.transform {
case u: UnresolvedAttribute if resolver(u.name, VirtualColumn.groupingIdName) => gid
}.asInstanceOf[NamedExpression])
g.copy(aggregations = aggExprs, groupByExprs = g.groupByExprs :+ gid)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has a potential bug. Will fix it soon.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50907 has finished for PR 11100 at commit 18f4130.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50909 has finished for PR 11100 at commit 114e0eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50910 has finished for PR 11100 at commit 114e0eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50925 has finished for PR 11100 at commit 114e0eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51080 has finished for PR 11100 at commit 114e0eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@davies @hvanhovell @aray I just realized you just reviewed a related PR: #10677. Could you also review this one? I just checked the results are still wrong in the latest code after the merge of #10677. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51087 has finished for PR 11100 at commit 524dfa0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor

aray commented Feb 11, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51100 has started for PR 11100 at commit e62c3d0.

@shaneknapp
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51111 has finished for PR 11100 at commit e62c3d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

else {
g
}
case x: GroupingSets =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add if g.expressions.forall(_.resolved) to make sure that all the expression are all resolved.

@davies
Copy link
Contributor

davies commented Feb 11, 2016

@gatorsmile Thanks for work on this.

The secret column name GROUPING__ID is only introduced by Hive, unfortunately the implementation is wrong, we don't follow that(corrected it). We can't be compatible with Hive anyway, so I'd like to not support it (it's OK for 2.0), we can have an error message to tell user to use grouping_id().

For the other bug, that could be fixed by resolve all the expressions before GroupingSet (one line change).

@davies
Copy link
Contributor

davies commented Feb 11, 2016

@gatorsmile checked that your two tests could pass with these two tiny changes.

@gatorsmile
Copy link
Member Author

Thank you! @davies @aray

Yeah, my first fix is very similar to what you proposed above. Will remember what you said regarding GROUPING__ID. After the release of 2.0, I will try to deprecate it and issue an error message.

BTW, just tried the code changes and it works well in my local environment. Updated the codes. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51169 has finished for PR 11100 at commit 79c11de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Feb 12, 2016

@gatorsmile I think the first commit should be enough. Since 2.0 is the best chance to deprecate GROUPING__ID, we should do that BEFORE release 2.0.

@gatorsmile
Copy link
Member Author

Sure, will revert the changes. Thank you!

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51197 has finished for PR 11100 at commit ed518f9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51229 has finished for PR 11100 at commit 7631371.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@davies Already deprecated GROUPING__ID in the latest commits and also cleaned all the test cases.

Also tried to output a better error message when users manually specify GROUPING__ID in the query, but I hit an issue. To compare the column names, we need to pass conf or resolver to the function checkAnalysis. This will touch multiple files that call this function. I am not sure if it is worthy for this purpose. Let me know if you want me to do it. Thanks!

  def resolver: Resolver = {
    if (conf.caseSensitiveAnalysis) {
      caseSensitiveResolution
    } else {
      caseInsensitiveResolution
    }
  }

GroupingSets(bitmasks(r), groupByExprs, child, aggregateExpressions)
// Ensure all the expressions have been resolved.
case g: GroupingSets if g.expressions.exists(!_.resolved) => g
case x: GroupingSets =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more clear if you move the if to next case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do. Thanks!

@davies
Copy link
Contributor

davies commented Feb 14, 2016

@gatorsmile I think you can check GROUPING__ID in ResolveGroupingAnalytics, then raise an error

@gatorsmile
Copy link
Member Author

@davies As you suggested, the latest commit has the following changes:

  • Replaced GROUPING__ID by grouping_id() in all the test cases and move the results into the test cases
  • Moved all these test cases in hive/execution/HiveQuerySuite.scala to hive/execution/SQLQuerySuite.scala.
  • Added codes for detecting the possible usage errors in the rule ResolveGroupingAnalytics and issue an error message if necessary

Thanks!

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 14, 2016

Test build #51273 has finished for PR 11100 at commit 1ca66ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Whoops triggered build unnecessarily

@hvanhovell
Copy link
Contributor

@gatorsmile do we still need VirtualColumn? Other than that LGTM.

@gatorsmile
Copy link
Member Author

@hvanhovell Thank you for your reviews!

Although we deprecate GROUPING__ID, the new grouping_id() still requires VirtualColumn.

I like this change. Users are not allowed to select/query this hidden/secret column now.

@SparkQA
Copy link

SparkQA commented Feb 14, 2016

Test build #51275 has finished for PR 11100 at commit 1ca66ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Feb 16, 2016

LGTM, we keep the VirtualColumn to show a better error message, merging this into master, thanks!

@asfgit asfgit closed this in fee739f Feb 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants