[SPARK-14255] [SQL] Streaming Aggregation #12048

marmbrus · 2016-03-29T22:46:14Z

This PR adds the ability to perform aggregations inside of a ContinuousQuery. In order to implement this feature, the planning of aggregation has augmented with a new StatefulAggregationStrategy. Unlike batch aggregation, stateful-aggregation uses the StateStore (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression:

Partial Aggregation
Shuffle
Partial Merge (now there is at most 1 tuple per group)
StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous)
Partial Merge (now there is at most 1 tuple per group)
StateStoreSave (saves the tuple for the next batch)
Complete (output the current result of the aggregation)

The following refactoring was also performed to allow us to plug into existing code:

The get/put implementation is taken from [SPARK-14214][SQL] Update state to provide get/put interface #12013
The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern PhysicalAggregation
The AttributeReference used to identify the result of an AggregateFunction as been moved into the AggregateExpression container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a Map[(AggregateFunction, Boolean), Attribute]. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup.
Some planning logic is moved from the SessionState into the QueryExecution to make it easier to override in the streaming case.
The ability to write a StreamTest that checks only the output of the last batch has been added to simulate the future addition of output modes.

Conflicts: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

SparkQA · 2016-03-29T22:52:50Z

Test build #54469 has finished for PR 12048 at commit 7a5e0ae.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-30T00:34:56Z

Test build #54470 has finished for PR 12048 at commit 29355db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-30T03:52:31Z

Test build #54485 has finished for PR 12048 at commit 6aeb27a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-30T04:48:49Z

Test this please

marmbrus · 2016-03-30T18:28:58Z

test this please

SparkQA · 2016-03-30T20:09:12Z

Test build #54543 has finished for PR 12048 at commit 6aeb27a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-30T20:12:20Z

Test build #2710 has finished for PR 12048 at commit 6aeb27a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-30T20:19:16Z

@tdas the maintenance test failed twice on this PR, but not the third time.

yhuai · 2016-03-30T23:22:11Z

Aggregation part looks good to me.

tdas · 2016-04-01T03:53:26Z

sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala

-      } catch {
-        case NonFatal(e) =>
-          failTest(message, e)
+      if (!condition) {


i had written this in this way so that if there are any errors in the lazy eval of condition that gets caught and message printed correctly. Happened to me a few times.

All uses of verify are just doing equality checks with variables (and thus can't throw), except for those that were modified specifically such that that were going to throw exceptions upon failure. So I think really the problem is overloading what was a simple assert to be an error handler.

The issue this this construction is it now pollutes the output with the obvious:

condition was false [info] org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) [info] org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) [info] org.apache.spark.sql.StreamTest$class.verify$1(StreamTest.scala:228) [info] org.apache.spark.sql.StreamTest$$anonfun$testStream$1.apply(StreamTest.scala:355) [info] org.apache.spark.sql.StreamTest$$anonfun$testStream$1.apply(StreamTest.scala:271) [info] scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) [info] org.apache.spark.sql.StreamTest$class.testStream(StreamTest.scala:271) [info] org.apache.spark.sql.streaming.StreamSuite.testStream(StreamSuite.scala:24)

marmbrus · 2016-04-01T22:17:55Z

Thanks, I'm going to merge to master and will address further comments in follow-ups.

marmbrus added 9 commits March 28, 2016 11:52

Basic Stateful Agg

00f059d

add testing for delta queries

17a829b

aggregation refactoring

b4fed4a

plug state managment into the query execution

d6f7941

Add get/put to StateStore

7a19692

cleanup and refactoring

f7bb8d2

cleanup

b46c325

Merge remote-tracking branch 'apache/master' into statefulAgg

1ef82c6

Conflicts: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

style

7a5e0ae

fix compilation

29355db

fix hive test

6aeb27a

tdas reviewed Apr 1, 2016
View reviewed changes

asfgit closed this in 0fc4aaa Apr 1, 2016

tdas mentioned this pull request Apr 4, 2016

[SPARK-14214][SQL] Update state to provide get/put interface #12013

Closed

[SPARK-14255] [SQL] Streaming Aggregation #12048

[SPARK-14255] [SQL] Streaming Aggregation #12048

Uh oh!

Conversation

marmbrus commented Mar 29, 2016

Uh oh!

SparkQA commented Mar 29, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

marmbrus commented Mar 30, 2016

Uh oh!

marmbrus commented Mar 30, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

SparkQA commented Mar 30, 2016

Uh oh!

marmbrus commented Mar 30, 2016

Uh oh!

yhuai commented Mar 30, 2016

Uh oh!

tdas Apr 1, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus Apr 1, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Apr 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants