[SPARK-10735] [SQL] Generate aggregation w/o grouping keys [WIP] #10786

davies · 2016-01-16T09:21:52Z

The benchmark show that generated aggregation could be 6X time faster than non-generated (with generated filter/range).

It also showed that declarative function is much faster than imperative one (5X faster for stddev), we may need to switch to a declarative version. (the difference need to be measured when there are grouping keys).

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodegenFallback.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala

SparkQA · 2016-01-16T09:35:34Z

Test build #49526 has finished for PR 10786 at commit 640a4b5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class StddevAgg(child: Expression) extends DeclarativeAggregate
- case class StddevPop1(child: Expression) extends StddevAgg(child)
- case class StddevSamp1(child: Expression) extends StddevAgg(child)

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

SparkQA · 2016-01-16T19:57:06Z

Test build #49537 has finished for PR 10786 at commit 6911a05.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-17T05:57:44Z

Test build #49540 has finished for PR 10786 at commit 10f6bb9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-17T07:56:39Z

Test build #49546 has finished for PR 10786 at commit 4c655dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-17T08:28:54Z

Test build #49548 has finished for PR 10786 at commit c8ed5d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-01-18T19:57:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

Based on this benchmark, Declarative function is faster than imperative function, both in whole stage codegen or not. Should we switch to implement all builtin aggregate functions as declarative? cc @mengxr @rxin @marmbrus @rxin .

Glad to see 5x speed-up! +1 on switching to declarative on stddev (and other 2nd-order statistics). But for skewness and kurtosis, we still need to benchmark the performance to decide because their expressions are more complex.

I've always preferred declarative aggregate because its much easier to optimize (very cool the kind of speed ups you are getting!). As such, I'd support having all of our built in functions done this way.

@mengxr argues that its too confusing for users and that we should also support the imperative one. How high cost is this for us?

As discussed in #10786, the generated TungstenAggregate does not support imperative functions. For a query ``` sqlContext.range(10).filter("id > 1").groupBy().count() ``` The generated code will looks like: ``` /* 032 */ if (!initAgg0) { /* 033 */ initAgg0 = true; /* 034 */ /* 035 */ // initialize aggregation buffer /* 037 */ long bufValue2 = 0L; /* 038 */ /* 039 */ /* 040 */ // initialize Range /* 041 */ if (!range_initRange5) { /* 042 */ range_initRange5 = true; ... /* 071 */ } /* 072 */ /* 073 */ while (!range_overflow8 && range_number7 < range_partitionEnd6) { /* 074 */ long range_value9 = range_number7; /* 075 */ range_number7 += 1L; /* 076 */ if (range_number7 < range_value9 ^ 1L < 0) { /* 077 */ range_overflow8 = true; /* 078 */ } /* 079 */ /* 085 */ boolean primitive11 = false; /* 086 */ primitive11 = range_value9 > 1L; /* 087 */ if (!false && primitive11) { /* 092 */ // do aggregate and update aggregation buffer /* 099 */ long primitive17 = -1L; /* 100 */ primitive17 = bufValue2 + 1L; /* 101 */ bufValue2 = primitive17; /* 105 */ } /* 107 */ } /* 109 */ /* 110 */ // output the result /* 112 */ bufferHolder25.reset(); /* 114 */ rowWriter26.initialize(bufferHolder25, 1); /* 118 */ rowWriter26.write(0, bufValue2); /* 120 */ result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize()); /* 121 */ currentRow = result24; /* 122 */ return; /* 124 */ } /* 125 */ ``` cc nongli Author: Davies Liu <[email protected]> Closes #10840 from davies/gen_agg.

davies · 2016-01-25T23:30:51Z

This PR will be replaced by multiple small PRs.

Davies Liu added 27 commits January 11, 2016 17:34

whole stage codegen

2da493f

support range

998b6a1

add benchmark

218412f

remove println and test

158cb36

fix bug

88c51a6

fix tests

b76c00e

clean the interface

4e2ee88

update test

f05524c

fix style, improve comments

43139a8

fix style

73fe074

fix test

32309e1

WholeStageCodegen/FakeInput as adapter

38029bc

address comments

1f9ddb8

fix style

c1dd60a

Merge branch 'master' of github.com:apache/spark into whole2

16ce50c

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

fix bug

d453081

fix test

34a0a6f

address comments

908c8cb

use sparkPlan for checking

1feab20

create a rule for whole stage codegen

c9741ea

renaming

7c05703

fix style

0b40106

generate aggregation

3df7e5d

support impremitive function

b43f262

support imperative functions

640a4b5

Merge branch 'master' of github.com:apache/spark into gen_agg2

6911a05

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

fix bug

10f6bb9

Davies Liu added 2 commits January 16, 2016 23:28

update benchmark

4c655dd

enable sub-expression elimination

c8ed5d9

davies force-pushed the gen_agg2 branch from 43b9c4e to c8ed5d9 Compare January 17, 2016 07:56

davies reviewed Jan 18, 2016
View reviewed changes

davies mentioned this pull request Jan 19, 2016

[SPARK-12797] [SQL] Generated TungstenAggregate (without grouping keys) #10840

Closed

davies closed this Jan 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10735] [SQL] Generate aggregation w/o grouping keys [WIP] #10786

[SPARK-10735] [SQL] Generate aggregation w/o grouping keys [WIP] #10786

Uh oh!

davies commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 17, 2016

Uh oh!

SparkQA commented Jan 17, 2016

Uh oh!

SparkQA commented Jan 17, 2016

Uh oh!

davies Jan 18, 2016

Uh oh!

mengxr Jan 19, 2016

Uh oh!

marmbrus Jan 19, 2016

Uh oh!

davies commented Jan 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-10735] [SQL] Generate aggregation w/o grouping keys [WIP] #10786

[SPARK-10735] [SQL] Generate aggregation w/o grouping keys [WIP] #10786

Uh oh!

Conversation

davies commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 17, 2016

Uh oh!

SparkQA commented Jan 17, 2016

Uh oh!

SparkQA commented Jan 17, 2016

Uh oh!

davies Jan 18, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr Jan 19, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus Jan 19, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented Jan 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants