[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs #9429

aray · 2015-11-03T06:21:45Z

This is an alternative to #9419

I got tired of fighting/fixing the bugs with the existing implementation of cube/rollup/grouping sets specifically around the Expand operator so I reimplemented it as a Generator. I think this makes for a cleaner implementation. I also added unit tests that show this implementation solves SPARK-11275.

I look forward to your comments!

cc: @rxin @marmbrus @gatorsmile @rick-ibm @hvanhovell @chenghao-intel @holdenk

- added unit tests for cube and rollup that actualy check the result - fixed bugs present in previous implementation of cube/rollup/groupingsets (SPARK-11275)

chenghao-intel · 2015-11-03T07:20:12Z

That's cool, I like the idea to re-implement it by introducing a UDTF function, which looks much simpler.

But it breaks the unit test in my local, as run the unit test like below:

build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite'

Some unresolved attribute exceptions, I didn't dig it yet, but can you solve that?

chenghao-intel · 2015-11-03T07:21:54Z

And anyone trigger the unit test? @yhuai @liancheng

yhuai · 2015-11-03T07:26:03Z

ok to test

chenghao-intel · 2015-11-03T07:31:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

expr => expr.transformDown { .. }

Otherwise it's not able to substitute the expression like sum(a+b) + count(c) for a+b.

@chenghao-intel actually that change would bring back the bug in question since it would do the substitutions in situations like below and the aggregations would be computed off the manipulated (nulls inserted) values.

select a + b, c, sum(a+b) + count(c) from t1 group by a + b, c with rollup

In general anything below an AggregateExpression we don't want to transform, but above we do. So really I need a transformDownUntil method. BTW making this change does fix the groupby_grouping_sets1 test so I really do need to do something.

SparkQA · 2015-11-03T09:44:49Z

Test build #44901 has finished for PR 9429 at commit e463679.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-11-04T01:16:49Z

After checking with Hive like:

hive> select sum(a-b) as ab from mytable group by b with rollup;
FAILED: SemanticException [Error 10210]: Grouping sets aggregations (with rollups or cubes) are not allowed if aggregation function parameters overlap with the aggregation functions columns

Hive actually doesn't support the overlap with the aggregation functions columns. Probably we can have a simple fixing based on the current master branch if we need to support that.

And after double checking, the master branch will be optimized for expression constant folding while with the Expand operator, it means better performance than re-implemented based on UDTF, so I am a little struggling which is the better approach for the implementation.

aray · 2015-11-17T17:57:11Z

I'm going to close this PR in favor of just fixing the current implementation for now since it has recently become more optimized with support for unsafe rows. Thanks everyone for your comments. Some time later I may revive this Expand as Generator patch but as a separate ticket.

aray added 2 commits November 2, 2015 23:56

Reimplement Expand as a Generator

fad28d6

- added unit tests for cube and rollup that actualy check the result - fixed bugs present in previous implementation of cube/rollup/groupingsets (SPARK-11275)

newline at end of generators.scala

e463679

chenghao-intel reviewed Nov 3, 2015
View reviewed changes

cloud-fan mentioned this pull request Nov 4, 2015

[SPARK-11275][SQL] Rollup and Cube Generates the Incorrect Results when Aggregation Functions Use Group By Columns #9419

Closed

aray closed this Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs #9429

[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs #9429

Uh oh!

aray commented Nov 3, 2015

Uh oh!

chenghao-intel commented Nov 3, 2015

Uh oh!

chenghao-intel commented Nov 3, 2015

Uh oh!

yhuai commented Nov 3, 2015

Uh oh!

chenghao-intel Nov 3, 2015

Uh oh!

aray Nov 3, 2015

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

chenghao-intel commented Nov 4, 2015

Uh oh!

aray commented Nov 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs #9429

[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs #9429

Uh oh!

Conversation

aray commented Nov 3, 2015

Uh oh!

chenghao-intel commented Nov 3, 2015

Uh oh!

chenghao-intel commented Nov 3, 2015

Uh oh!

yhuai commented Nov 3, 2015

Uh oh!

chenghao-intel Nov 3, 2015

Choose a reason for hiding this comment

Uh oh!

aray Nov 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

chenghao-intel commented Nov 4, 2015

Uh oh!

aray commented Nov 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants