[SPARK-18969][SQL] Support grouping by nondeterministic expressions - WIP #16379

rxin · 2016-12-22T01:35:24Z

What changes were proposed in this pull request?

WIP

How was this patch tested?

Added a new test suite PullOutNondeterministicSuite to cover the new case and existing cases.

rxin · 2016-12-22T01:35:35Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/analysis/PullOutNondeterministicSuite.scala

+  test("aggregate") {
+    checkAnalysis(
+      r.groupBy(rnd)(rnd),
+      r.select(a, b, rnd).groupBy(rndref)(rndref)


@cloud-fan any idea why this test case fail?

failure

[info] - aggregate *** FAILED *** (22 milliseconds) [info] == FAIL: Plans do not match === [info] !Aggregate [_nondeterministic#0], [_nondeterministic#0 AS _nondeterministic#0] Aggregate [_nondeterministic#0], [_nondeterministic#0] [info] +- Project [a#0, b#0, rand(10) AS _nondeterministic#0] +- Project [a#0, b#0, rand(10) AS _nondeterministic#0] [info] +- LocalRelation <empty>, [a#0, b#0] +- LocalRelation <empty>, [a#0, b#0] (PlanTest.scala:95)

rxin · 2016-12-22T01:35:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-          val leafNondeterministic = expr.collect {
-            case n: Nondeterministic => n
-          }
+      case p: UnaryNode if p.expressions.exists(!_.deterministic) =>


note that we might want to whitelist operators rather than looking at all the unary nodes ...

It looks like the ones that are interesting are:

Pivot

Window

Aggregate

RedistributeData (RepartitionByExpression and SortPartitions)

Sort

SparkQA · 2016-12-22T01:39:34Z

Test build #70499 has finished for PR 16379 at commit ac81d95.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close #16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes #16404 from cloud-fan/groupby. (cherry picked from commit 871d266) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close apache#16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes apache#16404 from cloud-fan/groupby.

[SPARK-18969][SQL] Support grouping by nondeterministic expressions

ac81d95

rxin commented Dec 22, 2016

View reviewed changes

cloud-fan mentioned this pull request Dec 26, 2016

[SPARK-18969][SQL] Support grouping by nondeterministic expressions #16404

Closed

asfgit closed this in 871d266 Jan 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18969][SQL] Support grouping by nondeterministic expressions - WIP #16379

[SPARK-18969][SQL] Support grouping by nondeterministic expressions - WIP #16379

Uh oh!

rxin commented Dec 22, 2016

Uh oh!

rxin Dec 22, 2016

Uh oh!

rxin Dec 22, 2016

Uh oh!

rxin Dec 22, 2016

Uh oh!

rxin Dec 22, 2016

Uh oh!

SparkQA commented Dec 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-18969][SQL] Support grouping by nondeterministic expressions - WIP #16379

[SPARK-18969][SQL] Support grouping by nondeterministic expressions - WIP #16379

Uh oh!

Conversation

rxin commented Dec 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin Dec 22, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Dec 22, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Dec 22, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Dec 22, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants