Skip to content

Conversation

@kunalkhamar
Copy link
Contributor

@kunalkhamar kunalkhamar commented Mar 8, 2017

What changes were proposed in this pull request?

If the user changes the shuffle partition number between batches, Streaming aggregation will fail.

Here are some possible cases:

  • Change "spark.sql.shuffle.partitions"
  • Use "repartition" and change the partition number in codes
  • RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future.

How was this patch tested?

  • Unit tests
  • Manual tests
    • forward compatibility tested by using the new OffsetSeqMetadata json with Spark v2.1.0

@kunalkhamar kunalkhamar changed the title [SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. [SPARK-19873][SS][WIP] Record num shuffle partitions in offset log and enforce in next batch. Mar 8, 2017
@kunalkhamar kunalkhamar changed the title [SPARK-19873][SS][WIP] Record num shuffle partitions in offset log and enforce in next batch. [SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. Mar 8, 2017
case class OffsetSeqMetadata(
var batchWatermarkMs: Long = 0,
var batchTimestampMs: Long = 0,
var numShufflePartitions: Int = 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use conf: Map[String, String] here because we probably will add more confs to this class in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kunalkhamar, in case you would update OffsetSeq's log version number, the work being done in #17070 might be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing Changed to a map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw-lin Hi Liwei!
Thanks for letting me know, we will not be updating the log version number since backward and forward compatibility is preserved by this patch.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74224 has finished for PR 17216 at commit 12f5fd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class OffsetSeqMetadata(

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74226 has finished for PR 17216 at commit 9ff4d29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/*
* For backwards compatibility, if # partitions was not recorded in the offset log, then
* ensure it is non-zero. The new value is picked up from the conf.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for inline comment with the code, use // and not /* .. */.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@kunalkhamar
Copy link
Contributor Author

@zsxwing @uncleGen @lw-lin
This is ready for another review, can you please take a look when you get a chance?

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74290 has finished for PR 17216 at commit 60ec7da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74344 has finished for PR 17216 at commit f6bd071.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2017

Test build #74352 has finished for PR 17216 at commit 3af1cb4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2017

Test build #74353 has finished for PR 17216 at commit 1cacd32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
case class OffsetSeqMetadata(var batchWatermarkMs: Long = 0, var batchTimestampMs: Long = 0) {
case class OffsetSeqMetadata(
var batchWatermarkMs: Long = 0,
Copy link
Contributor

@tdas tdas Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we have this as var? Can they be made into vals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to vals.

}

// If the number of partitions is greater, should throw exception.
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "15") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check whether the returned message is useful?

Copy link
Contributor Author

@kunalkhamar kunalkhamar Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay to me. Underlying cause is FileNotFoundException. Error message indicates Error reading delta file /path/to/checkpoint/state/[operator]/[partition]/[batch].delta

[info] - SPARK-19873: backward compatibility - recover with wrong num shuffle partitions *** FAILED *** (12 seconds, 98 milliseconds)
[info] org.apache.spark.sql.streaming.StreamingQueryException: Query badQuery [id = dddc5e7f-1e71-454c-8362-de184444fb5a, runId = b2960c74-257a-4eb1-b242-61d13e20655f] terminated with exception: Job aborted due to stage failure: Task 10 in stage 1.0 failed 1 times, most recent failure: Lost task 10.0 in stage 1.0 (TID 11, localhost, executor driver): java.lang.IllegalStateException: Error reading delta file /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10/1.delta of HDFSStateStoreProvider[id = (op=0, part=10), dir = /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10]: /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10/1.delta does not exist
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:384)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:336)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:333)
[info] at scala.Option.getOrElse(Option.scala:121)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:333)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:332)
[info] at scala.Option.getOrElse(Option.scala:121)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:332)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:239)
[info] at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:191)
[info] at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
[info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
[info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[info] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
[info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
[info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[info] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
[info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
[info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
[info] at org.apache.spark.scheduler.Task.run(Task.scala:108)
[info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info] at java.lang.Thread.run(Thread.java:745)
[info] Caused by: java.io.FileNotFoundException: File /Users/kunalkhamar/spark/target/tmp/spark-2816c3be-610f-450c-a821-6d0c68a12d91/state/0/10/1.delta does not exist
[info] at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
[info] at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
[info] at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)
[info] at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
[info] at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
[info] at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
[info] at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:61)
[info] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
[info] at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:381)
[info] ... 24 more

// Checkpoint data was generated by a query with 10 shuffle partitions.
// Test if recovery from checkpoint is successful.
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
query.start().processAllAvailable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear that this would actually re-execute a batch. unless a batch is executed, this does not test anything. so how about you add more data after processAllAvailable(), to ensure that at least one batch is actually executed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

inputData.addData(3, 4, 5, 6)
inputData.addData(5, 6, 7, 8)

val resourceUri =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying that start the query with existing checkpoints generated by 2.1 which do not have shuffle partitions recorded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments.

QueryTest.checkAnswer(spark.table("counts").toDF(),
Row("1", 1) :: Row("2", 1) :: Row("3", 2) :: Row("4", 2) ::
Row("5", 2) :: Row("6", 2) :: Row("7", 1) :: Row("8", 1) :: Nil)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont seem to stop the query? would be good put a try .. finally within the withSQLConf to stop the query. otherwise can lead to cascaded failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added try .. finally

@tdas
Copy link
Contributor

tdas commented Mar 13, 2017

LGTM. a few comments in tests.

val shufflePartitionsSparkSession: Int = sparkSession.conf.get(SQLConf.SHUFFLE_PARTITIONS)
offsetSeqMetadata = {
if (nextOffsets.metadata.isEmpty) {
OffsetSeqMetadata(0, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make this call with named params

OffsetSeqMetadata(
  batchWatermarkMs = 0,  
  ...
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

val shufflePartitionsToUse = metadata.conf.getOrElse(SQLConf.SHUFFLE_PARTITIONS.key, {
// For backward compatibility, if # partitions was not recorded in the offset log,
// then ensure it is not missing. The new value is picked up from the conf.
logDebug("Number of shuffle partitions from previous run not found in checkpoint. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a log warning. So that we can debug. And it should be printed only once, at the time of upgrading for the first time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to log warning.
Rechecked the semantics, it works as expected and warning only printed at time of first upgrade.
Once we restart query from a v2.1 checkpoint and then stop it, any new offsets written out will contain num shuffle partitions. Any future restarts will read these new offsets in StreamExecution.populateStartOffsets->offsetLog.getLatest and pick up the recorded num shuffle partitions.
Useful to note for future reference that we do not change the old offset files to contain num shuffle partitions, the semantics are correct because of call to offsetLog.getLatest.

}
}
}
offsetSeqMetadata = OffsetSeqMetadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this offsetSeqMetadata.copy(batchWatermarkMs= batchWatermarkMs, batchTimestampMs = triggerClock.getTimeMillis()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed.

val checkpointDir = new File(resourceUri)

// 1 - Test if recovery from the checkpoint is successful.
init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: init -> prepareMemoryStream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "15") {
var streamingQuery: StreamingQuery = null
try {
intercept[StreamingQueryException] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// 2 - Check recovery with wrong num shuffle partitions
init()
withTempDir(dir => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
withTempDir { dir =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


// 1 - Test if recovery from the checkpoint is successful.
init()
withTempDir(dir => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: withTempDir { dir =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@tdas
Copy link
Contributor

tdas commented Mar 15, 2017

LGTM. Just a few nits.

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74561 has finished for PR 17216 at commit 030e635.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74564 has finished for PR 17216 at commit 5c851a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74617 has finished for PR 17216 at commit dfae7be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Mar 17, 2017

LGTM. Will merge after tests pass.

@uncleGen
Copy link
Contributor

uncleGen commented Mar 17, 2017

Does this PR mix in some test files?

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74708 has finished for PR 17216 at commit 4733b4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Just some comments.

I notified one issue about SessionState.clone: listenerManager is not cloned. So batches in a streaming query cannot be monitored by the user. Of cause, it's not related to this PR. Could you fix it in a separate PR?

cd.dataType, cd.timeZoneId)
}

// Reset confs to disallow change in number of partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to set the confs for every batch? You can set it after recovering offsetSeqMetadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed.


// All set
assert(OffsetSeqMetadata(1, 2, getConfWith(shufflePartitions = 3)) ===
OffsetSeqMetadata(s"""{"batchWatermarkMs":1,"batchTimestampMs":2,"conf": {"$key":3}}"""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add a test to verify that unknown fields don't break the serialization? Such as

    assert(OffsetSeqMetadata(1, 2, getConfWith(shufflePartitions = 3)) ===
      OffsetSeqMetadata(
        s"""{"batchWatermarkMs":1,"batchTimestampMs":2,"conf": {"$key":3}},"unknown":1"""))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

…onf update occurrence to once at beginning when populating offsets.
@kunalkhamar
Copy link
Contributor Author

@uncleGen Not sure what that means, could you please elaborate?

@kunalkhamar
Copy link
Contributor Author

@zsxwing Will change cloning of listener manager in a new PR.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74754 has finished for PR 17216 at commit 3ae4414.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good.

val sparkSessionToRunBatches = sparkSession.cloneSession()
// Adaptive execution can change num shuffle partitions, disallow
sparkSessionToRunBatches.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false")
offsetSeqMetadata = OffsetSeqMetadata(batchWatermarkMs = 0, batchTimestampMs = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be kept. It should use the conf in the cloned session.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74757 has finished for PR 17216 at commit a2b32ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74758 has finished for PR 17216 at commit 3abe0a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Mar 17, 2017

LGTM. Merging it to master.

@asfgit asfgit closed this in 3783539 Mar 17, 2017
@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74760 has finished for PR 17216 at commit a0c71af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Mar 18, 2017 via email


// Checkpoint data was generated by a query with 10 shuffle partitions.
// In order to test reading from the checkpoint, the checkpoint must have two or more batches,
// since the last batch may be rerun.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants