Commit 8c82688
committed
[SPARK-13809][SQL] State store for streaming aggregations
## What changes were proposed in this pull request?
In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here.
https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#
## How was this patch tested?
- [x] Unit tests
- [x] Cluster tests
**Coverage from unit tests**
<img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png">
## TODO
- [x] Fix updates() iterator to avoid duplicate updates for same key
- [x] Use Coordinator in ContinuousQueryManager
- [x] Plugging in hadoop conf and other confs
- [x] Unit tests
- [x] StateStore object lifecycle and methods
- [x] StateStoreCoordinator communication and logic
- [x] StateStoreRDD fault-tolerance
- [x] StateStoreRDD preferred location using StateStoreCoordinator
- [ ] Cluster tests
- [ ] Whether preferred locations are set correctly
- [ ] Whether recovery works correctly with distributed storage
- [x] Basic performance tests
- [x] Docs
Author: Tathagata Das <[email protected]>
Closes #11645 from tdas/state-store.1 parent 0a64294 commit 8c82688
File tree
11 files changed
+2052
-0
lines changed- sql/core/src
- main/scala/org/apache/spark/sql
- execution/streaming/state
- internal
- test/scala/org/apache/spark/sql/execution/streaming/state
11 files changed
+2052
-0
lines changedLines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
| 37 | + | |
| 38 | + | |
36 | 39 | | |
37 | 40 | | |
38 | 41 | | |
| |||
0 commit comments