-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18969][SQL] Support grouping by nondeterministic expressions - WIP #16379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| test("aggregate") { | ||
| checkAnalysis( | ||
| r.groupBy(rnd)(rnd), | ||
| r.select(a, b, rnd).groupBy(rndref)(rndref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan any idea why this test case fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failure
[info] - aggregate *** FAILED *** (22 milliseconds)
[info] == FAIL: Plans do not match ===
[info] !Aggregate [_nondeterministic#0], [_nondeterministic#0 AS _nondeterministic#0] Aggregate [_nondeterministic#0], [_nondeterministic#0]
[info] +- Project [a#0, b#0, rand(10) AS _nondeterministic#0] +- Project [a#0, b#0, rand(10) AS _nondeterministic#0]
[info] +- LocalRelation <empty>, [a#0, b#0] +- LocalRelation <empty>, [a#0, b#0] (PlanTest.scala:95)
| val leafNondeterministic = expr.collect { | ||
| case n: Nondeterministic => n | ||
| } | ||
| case p: UnaryNode if p.expressions.exists(!_.deterministic) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we might want to whitelist operators rather than looking at all the unary nodes ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the ones that are interesting are:
- Pivot
- Window
- Aggregate
- RedistributeData (RepartitionByExpression and SortPartitions)
- Sort
|
Test build #70499 has finished for PR 16379 at commit
|
## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close #16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes #16404 from cloud-fan/groupby. (cherry picked from commit 871d266) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close #16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes #16404 from cloud-fan/groupby. (cherry picked from commit 871d266) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close apache#16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes apache#16404 from cloud-fan/groupby.
## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close apache#16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes apache#16404 from cloud-fan/groupby.
What changes were proposed in this pull request?
WIP
How was this patch tested?
Added a new test suite PullOutNondeterministicSuite to cover the new case and existing cases.