Skip to content

Conversation

@aray
Copy link
Contributor

@aray aray commented Nov 3, 2015

This is an alternative to #9419

I got tired of fighting/fixing the bugs with the existing implementation of cube/rollup/grouping sets specifically around the Expand operator so I reimplemented it as a Generator. I think this makes for a cleaner implementation. I also added unit tests that show this implementation solves SPARK-11275.

I look forward to your comments!

cc: @rxin @marmbrus @gatorsmile @rick-ibm @hvanhovell @chenghao-intel @holdenk

aray added 2 commits November 2, 2015 23:56
- added unit tests for cube and rollup that actualy check the result
- fixed bugs present in previous implementation of cube/rollup/groupingsets (SPARK-11275)
@chenghao-intel
Copy link
Contributor

That's cool, I like the idea to re-implement it by introducing a UDTF function, which looks much simpler.

But it breaks the unit test in my local, as run the unit test like below:

build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite'

Some unresolved attribute exceptions, I didn't dig it yet, but can you solve that?

@chenghao-intel
Copy link
Contributor

And anyone trigger the unit test? @yhuai @liancheng

@yhuai
Copy link
Contributor

yhuai commented Nov 3, 2015

ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expr => expr.transformDown { 
..
}

Otherwise it's not able to substitute the expression like sum(a+b) + count(c) for a+b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel actually that change would bring back the bug in question since it would do the substitutions in situations like below and the aggregations would be computed off the manipulated (nulls inserted) values.

select a + b, c, sum(a+b) + count(c)
from t1
group by a + b, c with rollup

In general anything below an AggregateExpression we don't want to transform, but above we do. So really I need a transformDownUntil method. BTW making this change does fix the groupby_grouping_sets1 test so I really do need to do something.

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44901 has finished for PR 9429 at commit e463679.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor

After checking with Hive like:

hive> select sum(a-b) as ab from mytable group by b with rollup;
FAILED: SemanticException [Error 10210]: Grouping sets aggregations (with rollups or cubes) are not allowed if aggregation function parameters overlap with the aggregation functions columns

Hive actually doesn't support the overlap with the aggregation functions columns. Probably we can have a simple fixing based on the current master branch if we need to support that.

And after double checking, the master branch will be optimized for expression constant folding while with the Expand operator, it means better performance than re-implemented based on UDTF, so I am a little struggling which is the better approach for the implementation.

@aray
Copy link
Contributor Author

aray commented Nov 17, 2015

I'm going to close this PR in favor of just fixing the current implementation for now since it has recently become more optimized with support for unsafe rows. Thanks everyone for your comments. Some time later I may revive this Expand as Generator patch but as a separate ticket.

@aray aray closed this Nov 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants