Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Dec 22, 2016

What changes were proposed in this pull request?

WIP

How was this patch tested?

Added a new test suite PullOutNondeterministicSuite to cover the new case and existing cases.

test("aggregate") {
checkAnalysis(
r.groupBy(rnd)(rnd),
r.select(a, b, rnd).groupBy(rndref)(rndref)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan any idea why this test case fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failure

[info] - aggregate *** FAILED *** (22 milliseconds)
[info]   == FAIL: Plans do not match ===
[info]   !Aggregate [_nondeterministic#0], [_nondeterministic#0 AS _nondeterministic#0]   Aggregate [_nondeterministic#0], [_nondeterministic#0]
[info]    +- Project [a#0, b#0, rand(10) AS _nondeterministic#0]                          +- Project [a#0, b#0, rand(10) AS _nondeterministic#0]
[info]       +- LocalRelation <empty>, [a#0, b#0]                                            +- LocalRelation <empty>, [a#0, b#0] (PlanTest.scala:95)

val leafNondeterministic = expr.collect {
case n: Nondeterministic => n
}
case p: UnaryNode if p.expressions.exists(!_.deterministic) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we might want to whitelist operators rather than looking at all the unary nodes ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the ones that are interesting are:

  • Pivot
  • Window
  • Aggregate
  • RedistributeData (RepartitionByExpression and SortPartitions)
  • Sort

@SparkQA
Copy link

SparkQA commented Dec 22, 2016

Test build #70499 has finished for PR 16379 at commit ac81d95.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 871d266 Jan 12, 2017
asfgit pushed a commit that referenced this pull request Jan 12, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close #16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes #16404 from cloud-fan/groupby.

(cherry picked from commit 871d266)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 12, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close #16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes #16404 from cloud-fan/groupby.

(cherry picked from commit 871d266)
Signed-off-by: Wenchen Fan <[email protected]>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close apache#16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes apache#16404 from cloud-fan/groupby.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close apache#16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <[email protected]>

Closes apache#16404 from cloud-fan/groupby.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants