[SPARK-13809][SQL] State store for streaming aggregations #11645

tdas · 2016-03-11T04:28:57Z

What changes were proposed in this pull request?

In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here.

https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#

How was this patch tested?

Unit tests
Cluster tests

Coverage from unit tests

## TODO - [x] Fix updates() iterator to avoid duplicate updates for same key - [x] Use Coordinator in ContinuousQueryManager - [x] Plugging in hadoop conf and other confs - [x] Unit tests - [x] StateStore object lifecycle and methods - [x] StateStoreCoordinator communication and logic - [x] StateStoreRDD fault-tolerance - [x] StateStoreRDD preferred location using StateStoreCoordinator - [ ] Cluster tests - [ ] Whether preferred locations are set correctly - [ ] Whether recovery works correctly with distributed storage - [x] Basic performance tests - [x] Docs

tdas · 2016-03-11T04:29:38Z

@marmbrus @rxin @zsxwing

SparkQA · 2016-03-11T04:34:16Z

Test build #52897 has finished for PR 11645 at commit f417bde.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class StateStoreOps[INPUT: ClassTag](dataRDD: RDD[INPUT])

SparkQA · 2016-03-11T04:44:25Z

Test build #52898 has finished for PR 11645 at commit d8cee54.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2016-03-11T06:36:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+    }
+  }
+
+  private def remove(storeId: StateStoreId): Unit = {


Maybe we also need synchronized for remove()?

SparkQA · 2016-03-12T02:59:03Z

Test build #52988 has finished for PR 11645 at commit 7d74c67.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T09:23:11Z

Test build #53064 has finished for PR 11645 at commit 7adca70.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T10:30:59Z

Test build #53070 has finished for PR 11645 at commit a0ba498.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ValueUpdated(key: InternalRow, value: InternalRow) extends StoreUpdate
- case class KeyRemoved(key: InternalRow) extends StoreUpdate

SparkQA · 2016-03-14T19:48:39Z

Test build #53085 has finished for PR 11645 at commit 48afbe6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-15T01:53:45Z

Test build #53146 has finished for PR 11645 at commit 22d7e66.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-15T03:30:12Z

Test build #53147 has finished for PR 11645 at commit 34ae7ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rishitesh · 2016-03-15T07:50:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala

+
+  override protected def getPartitions: Array[Partition] = dataRDD.partitions
+  override def getPreferredLocations(partition: Partition): Seq[String] = {
+    Seq.empty


Should not you be using preffered location here ?

Yes. Its still WIP. I need to add enable StateStoreCoordinator for this, which is what I am working on now.

SparkQA · 2016-03-15T09:13:55Z

Test build #53178 has finished for PR 11645 at commit 13c29a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ValueAdded(key: InternalRow, value: InternalRow) extends StoreUpdate
- case class ValueUpdated(key: InternalRow, value: InternalRow) extends StoreUpdate
- case class KeyRemoved(key: InternalRow) extends StoreUpdate

tdas · 2016-03-15T18:10:02Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+ * to ensure re-executed RDD operations re-apply updates on the correct past version of the
+ * store.
+ */
+class HDFSBackedStateStoreProvider(


note to self: private[state]

SparkQA · 2016-03-22T19:03:07Z

Test build #53805 has finished for PR 11645 at commit 63fad92.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T19:35:18Z

Test build #53794 has finished for PR 11645 at commit 4752d73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T19:44:58Z

Test build #53793 has finished for PR 11645 at commit 24fb325.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-22T20:28:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+      maintenanceTask.cancel(false)
+      maintenanceTask = null
+    }
+    logInfo("StateStore stopped")


~~You need to call maintenanceTaskExecutor.shutdown before this line.~~ To allow this companion object to be reused, it should not be shutdown.

SparkQA · 2016-03-22T20:45:09Z

Test build #53804 has finished for PR 11645 at commit 756762a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-22T20:49:15Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinator.scala

+
+  /**
+   * Create a reference to a [[StateStoreCoordinator]], This can be called from driver as well as
+   * executors.


This can not be called from executors as creating StateStoreCoordinator will succeed even if there is one StateStoreCoordinator in driver.

SparkQA · 2016-03-22T22:02:39Z

Test build #53791 has finished for PR 11645 at commit 2f29d9a.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-22T22:43:57Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+  }
+}
+
+private[state] object HDFSBackedStateStoreProvider {


nit: Remove this

SparkQA · 2016-03-23T00:07:34Z

Test build #53810 has finished for PR 11645 at commit b147f59.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-23T00:43:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala


+  test("distributed test") {
+    quietly {
+      withSpark(new SparkContext(sparkConf.setMaster("local-cluster[2, 1, 1024]"))) { sc =>


nit: should clone sparkConf

zsxwing · 2016-03-23T00:55:34Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+        if (keySize == -1) {
+          eof = true
+        } else if (keySize < 0) {
+          throw new Exception(


nit: IOException

zsxwing · 2016-03-23T01:13:36Z

Looks good overall. Just some bits

On Tue, Mar 22, 2016 at 5:53 PM Apache Spark QA [email protected]
wrote:

Test build #53857 has started
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53857/consoleFull
for PR 11645 at commit 70cc7b1
70cc7b1
.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#11645 (comment)

SparkQA · 2016-03-23T01:50:52Z

Test build #53845 has finished for PR 11645 at commit b147f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T02:03:22Z

Test build #53851 has finished for PR 11645 at commit 819ca17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T02:13:17Z

Test build #53850 has finished for PR 11645 at commit 313d8be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T02:55:48Z

Test build #53857 has finished for PR 11645 at commit 70cc7b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-03-23T19:40:57Z

I am merging this. Thanks @zsxwing and @marmbrus for the comments.

This PR adds the ability to perform aggregations inside of a `ContinuousQuery`. In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`. Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression: - Partial Aggregation - Shuffle - Partial Merge (now there is at most 1 tuple per group) - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous) - Partial Merge (now there is at most 1 tuple per group) - StateStoreSave (saves the tuple for the next batch) - Complete (output the current result of the aggregation) The following refactoring was also performed to allow us to plug into existing code: - The get/put implementation is taken from #12013 - The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation` - The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup. - Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case. - The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes. Author: Michael Armbrust <[email protected]> Closes #12048 from marmbrus/statefulAgg.

tdas added 3 commits March 10, 2016 15:03

First draft of StateStore

c019474

Updated tests

4f8dade

Added basic unit test for StateStoreRDD

f417bde

Style fix

d8cee54

lw-lin reviewed Mar 11, 2016
View reviewed changes

Fixed test

7d74c67

tdas added 2 commits March 14, 2016 02:10

Fixed versioning in StateStoreRDD, and made store updates thread-safe

c5dd061

Style fixes

7adca70

Added docs

a0ba498

Merge remote-tracking branch 'apache-github/master' into state-store

48afbe6

tdas added 3 commits March 14, 2016 16:56

Refactored for new design

bee673c

Merge remote-tracking branch 'apache-github/master' into HEAD

d963efa

Fixed a lot of things

22d7e66

Fixed style

34ae7ff

Fixed updates iterator

13c29a2

rishitesh reviewed Mar 15, 2016
View reviewed changes

tdas reviewed Mar 15, 2016
View reviewed changes

Style fix

b147f59

zsxwing reviewed Mar 22, 2016
View reviewed changes

Fixed coordinator bug and added distributed test

819ca17

tdas force-pushed the state-store branch from 313d8be to 819ca17 Compare March 23, 2016 00:14

zsxwing reviewed Mar 23, 2016
View reviewed changes

Addressed more comments

70cc7b1

zsxwing reviewed Mar 23, 2016
View reviewed changes

asfgit closed this in 8c82688 Mar 23, 2016

marmbrus mentioned this pull request Mar 29, 2016

[SPARK-14255] [SQL] Streaming Aggregation #12048

Closed

[SPARK-13809][SQL] State store for streaming aggregations #11645

[SPARK-13809][SQL] State store for streaming aggregations #11645

Uh oh!

Conversation

tdas commented Mar 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tdas commented Mar 11, 2016

Uh oh!

SparkQA commented Mar 11, 2016

Uh oh!

SparkQA commented Mar 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 12, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 15, 2016

Uh oh!

SparkQA commented Mar 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

tdas commented Mar 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants