[SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. #17216

kunalkhamar · 2017-03-08T22:50:17Z

What changes were proposed in this pull request?

If the user changes the shuffle partition number between batches, Streaming aggregation will fail.

Here are some possible cases:

Change "spark.sql.shuffle.partitions"
Use "repartition" and change the partition number in codes
RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future.

How was this patch tested?

Unit tests
Manual tests
- forward compatibility tested by using the new OffsetSeqMetadata json with Spark v2.1.0

zsxwing · 2017-03-08T23:20:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala

+case class OffsetSeqMetadata(
+    var batchWatermarkMs: Long = 0,
+    var batchTimestampMs: Long = 0,
+    var numShufflePartitions: Int = 0) {


It's better to use conf: Map[String, String] here because we probably will add more confs to this class in future.

Hi @kunalkhamar, in case you would update OffsetSeq's log version number, the work being done in #17070 might be helpful

@zsxwing Changed to a map

@lw-lin Hi Liwei!
Thanks for letting me know, we will not be updating the log version number since backward and forward compatibility is preserved by this patch.

SparkQA · 2017-03-09T00:52:03Z

Test build #74224 has finished for PR 17216 at commit 12f5fd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OffsetSeqMetadata(

SparkQA · 2017-03-09T01:12:44Z

Test build #74226 has finished for PR 17216 at commit 9ff4d29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2017-03-09T01:35:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+        /*
+         * For backwards compatibility, if # partitions was not recorded in the offset log, then
+         * ensure it is non-zero. The new value is picked up from the conf.
+         */


for inline comment with the code, use // and not /* .. */.

kunalkhamar · 2017-03-10T01:31:57Z

@zsxwing @uncleGen @lw-lin
This is ready for another review, can you please take a look when you get a chance?

SparkQA · 2017-03-10T03:04:22Z

Test build #74290 has finished for PR 17216 at commit 60ec7da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ataPath.

SparkQA · 2017-03-10T23:36:26Z

Test build #74344 has finished for PR 17216 at commit f6bd071.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-11T00:28:09Z

Test build #74352 has finished for PR 17216 at commit 3af1cb4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-11T00:49:03Z

Test build #74353 has finished for PR 17216 at commit 1cacd32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-03-13T21:44:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala

 */
-case class OffsetSeqMetadata(var batchWatermarkMs: Long = 0, var batchTimestampMs: Long = 0) {
+case class OffsetSeqMetadata(
+    var batchWatermarkMs: Long = 0,


Do you know why we have this as var? Can they be made into vals.

Changed to vals.

tdas · 2017-03-13T21:50:47Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+    }
+
+    // If the number of partitions is greater, should throw exception.
+    withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "15") {


can you check whether the returned message is useful?

Seems okay to me. Underlying cause is FileNotFoundException. Error message indicates Error reading delta file /path/to/checkpoint/state/[operator]/[partition]/[batch].delta

[info] - SPARK-19873: backward compatibility - recover with wrong num shuffle partitions *** FAILED *** (12 seconds, 98 milliseconds)
[info] org.apache.spark.sql.streaming.StreamingQueryException: Query badQuery [id = dddc5e7f-1e71-454c-8362-de184444fb5a, runId = b2960c74-257a-4eb1-b242-61d13e20655f] terminated with exception: Job aborted due to stage failure: Task 10 in stage 1.0 failed 1 times, most recent failure: Lost task 10.0 in stage 1.0 (TID 11, localhost, executor driver): java.lang.IllegalStateException: Error reading delta file /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10/1.delta of HDFSStateStoreProvider[id = (op=0, part=10), dir = /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10]: /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10/1.delta does not exist
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:384)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:336)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:333)
[info] at scala.Option.getOrElse(Option.scala:121)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:333)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:332)
[info] at scala.Option.getOrElse(Option.scala:121)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:332)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:239)
[info] at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:191)
[info] at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
[info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
[info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[info] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
[info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
[info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[info] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
[info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
[info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
[info] at org.apache.spark.scheduler.Task.run(Task.scala:108)
[info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info] at java.lang.Thread.run(Thread.java:745)
[info] Caused by: java.io.FileNotFoundException: File /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10/1.delta does not exist
[info] at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
[info] at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
[info] at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)
[info] at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
[info] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
[info] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
[info] at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:61)
[info] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:381)
[info] ... 24 more

tdas · 2017-03-13T21:52:52Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+    // Checkpoint data was generated by a query with 10 shuffle partitions.
+    // Test if recovery from checkpoint is successful.
+    withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
+      query.start().processAllAvailable()


its not clear that this would actually re-execute a batch. unless a batch is executed, this does not test anything. so how about you add more data after processAllAvailable(), to ensure that at least one batch is actually executed?

tdas · 2017-03-13T21:53:45Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+    inputData.addData(3, 4, 5, 6)
+    inputData.addData(5, 6, 7, 8)
+
+    val resourceUri =


can you add a comment saying that start the query with existing checkpoints generated by 2.1 which do not have shuffle partitions recorded.

Added more comments.

tdas · 2017-03-13T21:54:31Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+      QueryTest.checkAnswer(spark.table("counts").toDF(),
+        Row("1", 1) :: Row("2", 1) :: Row("3", 2) :: Row("4", 2) ::
+        Row("5", 2) :: Row("6", 2) :: Row("7", 1) :: Row("8", 1) :: Nil)
+    }


you dont seem to stop the query? would be good put a try .. finally within the withSQLConf to stop the query. otherwise can lead to cascaded failures.

Added try .. finally

tdas · 2017-03-13T21:57:31Z

LGTM. a few comments in tests.

tdas · 2017-03-14T23:56:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+        val shufflePartitionsSparkSession: Int = sparkSession.conf.get(SQLConf.SHUFFLE_PARTITIONS)
+        offsetSeqMetadata = {
+          if (nextOffsets.metadata.isEmpty) {
+            OffsetSeqMetadata(0, 0,


nit: can you make this call with named params

OffsetSeqMetadata( batchWatermarkMs = 0, ... )

tdas · 2017-03-15T00:09:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+            val shufflePartitionsToUse = metadata.conf.getOrElse(SQLConf.SHUFFLE_PARTITIONS.key, {
+              // For backward compatibility, if # partitions was not recorded in the offset log,
+              // then ensure it is not missing. The new value is picked up from the conf.
+              logDebug("Number of shuffle partitions from previous run not found in checkpoint. "


Make this a log warning. So that we can debug. And it should be printed only once, at the time of upgrading for the first time.

Changed to log warning.
Rechecked the semantics, it works as expected and warning only printed at time of first upgrade.
Once we restart query from a v2.1 checkpoint and then stop it, any new offsets written out will contain num shuffle partitions. Any future restarts will read these new offsets in StreamExecution.populateStartOffsets->offsetLog.getLatest and pick up the recorded num shuffle partitions.
Useful to note for future reference that we do not change the old offset files to contain num shuffle partitions, the semantics are correct because of call to offsetLog.getLatest.

tdas · 2017-03-15T00:10:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

          }
        }
      }
+      offsetSeqMetadata = OffsetSeqMetadata(


You can make this offsetSeqMetadata.copy(batchWatermarkMs= batchWatermarkMs, batchTimestampMs = triggerClock.getTimeMillis()

Good point, changed.

tdas · 2017-03-15T00:11:35Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+    val checkpointDir = new File(resourceUri)
+
+    // 1 - Test if recovery from the checkpoint is successful.
+    init()


nit: init -> prepareMemoryStream

tdas · 2017-03-15T00:14:18Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+      withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "15") {
+        var streamingQuery: StreamingQuery = null
+        try {
+          intercept[StreamingQueryException] {


what is the error message?

#17216 (comment)

tdas · 2017-03-15T00:14:42Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+
+    // 2 - Check recovery with wrong num shuffle partitions
+    init()
+    withTempDir(dir => {


nit:
withTempDir { dir =>

tdas · 2017-03-15T00:14:57Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+
+    // 1 - Test if recovery from the checkpoint is successful.
+    init()
+    withTempDir(dir => {


nit: withTempDir { dir =>

tdas · 2017-03-15T00:15:18Z

LGTM. Just a few nits.

SparkQA · 2017-03-15T00:27:04Z

Test build #74561 has finished for PR 17216 at commit 030e635.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-15T00:36:01Z

Test build #74564 has finished for PR 17216 at commit 5c851a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-15T22:18:02Z

Test build #74617 has finished for PR 17216 at commit dfae7be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-03-17T02:00:36Z

LGTM. Will merge after tests pass.

uncleGen · 2017-03-17T03:35:27Z

Does this PR mix in some test files?

SparkQA · 2017-03-17T04:03:28Z

Test build #74708 has finished for PR 17216 at commit 4733b4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Looks pretty good. Just some comments.

I notified one issue about SessionState.clone: listenerManager is not cloned. So batches in a streaming query cannot be monitored by the user. Of cause, it's not related to this PR. Could you fix it in a separate PR?

zsxwing · 2017-03-17T17:52:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

          cd.dataType, cd.timeZoneId)
    }

+    // Reset confs to disallow change in number of partitions


Why need to set the confs for every batch? You can set it after recovering offsetSeqMetadata.

Good point, changed.

zsxwing · 2017-03-17T17:54:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala

+
+    // All set
+    assert(OffsetSeqMetadata(1, 2, getConfWith(shufflePartitions = 3)) ===
+      OffsetSeqMetadata(s"""{"batchWatermarkMs":1,"batchTimestampMs":2,"conf": {"$key":3}}"""))


nit: could you add a test to verify that unknown fields don't break the serialization? Such as

assert(OffsetSeqMetadata(1, 2, getConfWith(shufflePartitions = 3)) === OffsetSeqMetadata( s"""{"batchWatermarkMs":1,"batchTimestampMs":2,"conf": {"$key":3}},"unknown":1"""))

…onf update occurrence to once at beginning when populating offsets.

kunalkhamar · 2017-03-17T19:07:52Z

@uncleGen Not sure what that means, could you please elaborate?

kunalkhamar · 2017-03-17T19:08:39Z

@zsxwing Will change cloning of listener manager in a new PR.

SparkQA · 2017-03-17T21:06:11Z

Test build #74754 has finished for PR 17216 at commit 3ae4414.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Looks pretty good.

zsxwing · 2017-03-17T22:29:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+      val sparkSessionToRunBatches = sparkSession.cloneSession()
+      // Adaptive execution can change num shuffle partitions, disallow
+      sparkSessionToRunBatches.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false")
+      offsetSeqMetadata = OffsetSeqMetadata(batchWatermarkMs = 0, batchTimestampMs = 0,


nit: remove line.

Yeah, this should be kept. It should use the conf in the cloned session.

SparkQA · 2017-03-17T23:05:59Z

Test build #74757 has finished for PR 17216 at commit a2b32ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-17T23:07:33Z

Test build #74758 has finished for PR 17216 at commit 3abe0a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-03-17T23:14:03Z

LGTM. Merging it to master.

SparkQA · 2017-03-17T23:22:15Z

Test build #74760 has finished for PR 17216 at commit a0c71af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-03-18T03:15:12Z

Seems like an unrelated failure. Probably a flaky test.

…

On Mar 17, 2017 4:23 PM, "UCB AMPLab" ***@***.***> wrote: Merged build finished. Test FAILed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#17216 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAoerHj4EZoc91N-MBjkjytKJ7VqxOccks5rmxXqgaJpZM4MXaiH> .

kunalkhamar · 2017-07-06T22:14:15Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

+
+      // Checkpoint data was generated by a query with 10 shuffle partitions.
+      // In order to test reading from the checkpoint, the checkpoint must have two or more batches,
+      // since the last batch may be rerun.


#18503 (comment)

Record num shuffle partitions in offset log and enforce in next batch.

12f5fd3

kunalkhamar changed the title ~~[SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch.~~ [SPARK-19873][SS][WIP] Record num shuffle partitions in offset log and enforce in next batch. Mar 8, 2017

Clean up.

9ff4d29

kunalkhamar changed the title ~~[SPARK-19873][SS][WIP] Record num shuffle partitions in offset log and enforce in next batch.~~ [SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. Mar 8, 2017

zsxwing reviewed Mar 8, 2017

View reviewed changes

uncleGen reviewed Mar 9, 2017

View reviewed changes

Add backward compat test. Review changes.

60ec7da

kunalkhamar added 4 commits March 10, 2017 12:53

Clone spark session once before all batches. Update test source metad…

c688e84

…ataPath.

Update test.

f6bd071

Fix and clean up.

3af1cb4

Clean up.

1cacd32

tdas reviewed Mar 13, 2017

View reviewed changes

kunalkhamar added 2 commits March 14, 2017 15:23

Update tests. Remove var in OffsetSeqMetadata.

030e635

Clean up.

5c851a5

tdas reviewed Mar 14, 2017

View reviewed changes

tdas reviewed Mar 15, 2017

View reviewed changes

Changes from review.

dfae7be

Merge branch 'master' into num-partitions

4733b4e

zsxwing reviewed Mar 17, 2017

View reviewed changes

Add test to check serialization behaviour of unknown fields, change c…

3ae4414

…onf update occurrence to once at beginning when populating offsets.

Refactor initialization of OffsetSeqMetadata.

a2b32ce

kunalkhamar added 2 commits March 17, 2017 14:08

nits.

3abe0a0

Force disabling of adaptive execution instead of dying if it is enabled.

a0c71af

zsxwing requested changes Mar 17, 2017

View reviewed changes

zsxwing approved these changes Mar 17, 2017

View reviewed changes

asfgit closed this in 3783539 Mar 17, 2017

kunalkhamar commented Jul 6, 2017

View reviewed changes

[SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. #17216

[SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. #17216

Uh oh!

Conversation

kunalkhamar commented Mar 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalkhamar commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 11, 2017

Uh oh!

SparkQA commented Mar 11, 2017

Uh oh!

tdas Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalkhamar Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kunalkhamar commented Mar 8, 2017 •

edited

Loading

tdas Mar 13, 2017 •

edited

Loading

kunalkhamar Mar 14, 2017 •

edited

Loading

uncleGen commented Mar 17, 2017 •

edited

Loading