-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs #9429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- added unit tests for cube and rollup that actualy check the result - fixed bugs present in previous implementation of cube/rollup/groupingsets (SPARK-11275)
|
That's cool, I like the idea to re-implement it by introducing a UDTF function, which looks much simpler. But it breaks the unit test in my local, as run the unit test like below: Some unresolved attribute exceptions, I didn't dig it yet, but can you solve that? |
|
And anyone trigger the unit test? @yhuai @liancheng |
|
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr => expr.transformDown {
..
}Otherwise it's not able to substitute the expression like sum(a+b) + count(c) for a+b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel actually that change would bring back the bug in question since it would do the substitutions in situations like below and the aggregations would be computed off the manipulated (nulls inserted) values.
select a + b, c, sum(a+b) + count(c)
from t1
group by a + b, c with rollup
In general anything below an AggregateExpression we don't want to transform, but above we do. So really I need a transformDownUntil method. BTW making this change does fix the groupby_grouping_sets1 test so I really do need to do something.
|
Test build #44901 has finished for PR 9429 at commit
|
|
After checking with Hive like: Hive actually doesn't support the overlap with the aggregation functions columns. Probably we can have a simple fixing based on the current master branch if we need to support that. And after double checking, the master branch will be optimized for expression constant folding while with the |
|
I'm going to close this PR in favor of just fixing the current implementation for now since it has recently become more optimized with support for unsafe rows. Thanks everyone for your comments. Some time later I may revive this Expand as Generator patch but as a separate ticket. |
This is an alternative to #9419
I got tired of fighting/fixing the bugs with the existing implementation of cube/rollup/grouping sets specifically around the Expand operator so I reimplemented it as a Generator. I think this makes for a cleaner implementation. I also added unit tests that show this implementation solves SPARK-11275.
I look forward to your comments!
cc: @rxin @marmbrus @gatorsmile @rick-ibm @hvanhovell @chenghao-intel @holdenk