[SPARK-22956][SS] Bug fix for 2 streams union failover scenario #20150

xuanyuanking · 2018-01-04T11:59:02Z

What changes were proposed in this pull request?

This problem reported by @yanlin-Lynn @ivoson and @LiangchangZ. Thanks!

When we union 2 streams from kafka or other sources, while one of them have no continues data coming and in the same time task restart, this will cause an IllegalStateException. This mainly cause because the code in MicroBatchExecution , while one stream has no continues data, its comittedOffset same with availableOffset during populateStartOffsets, and currentPartitionOffsets not properly handled in KafkaSource. Also, maybe we should also consider this scenario in other Source.

How was this patch tested?

Add a UT in KafkaSourceSuite.scala

SparkQA · 2018-01-04T12:12:18Z

Test build #85675 has finished for PR 20150 at commit aa3d7b7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-04T16:16:09Z

Test build #85678 has finished for PR 20150 at commit fa64187.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-01-05T06:37:06Z

cc @zsxwing

xuanyuanking · 2018-01-09T03:07:21Z

cc @gatorsmile @cloud-fan

zsxwing · 2018-01-10T00:48:14Z

@xuanyuanking could you post the full stack trace about this issue?

xuanyuanking · 2018-01-10T03:43:49Z

Hi Shixiong, thanks a lot for your reply.
The full stack below can reproduce by running the added UT based on original code base.

Assert on query failed: : Query [id = 3421db21-652e-47af-9d54-2b74a222abed, runId = cd8d7c94-1286-44a5-b000-a8d870aef6fa] terminated with exception: Partition topic-0-0's offset was changed from 10 to 5, some data may have been missed. 
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you don't want your streaming query to fail on such cases, set the
 source option "failOnDataLoss" to "false".
    
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
	org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)

	Caused by: 	Partition topic-0-0's offset was changed from 10 to 5, some data may have been missed. 
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you don't want your streaming query to fail on such cases, set the
 source option "failOnDataLoss" to "false".
    
	org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:332)
		org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:291)
		org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:289)
		scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
		scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
		scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
		scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
		scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
		scala.collection.AbstractTraversable.filter(Traversable.scala:104)
		org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:289)

zsxwing

Thanks for fixing this. Looks good to me. Just some nits.

zsxwing · 2018-01-12T22:08:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

      batches.slice(sliceStart, sliceEnd)
    }

+    if (newBlocks.isEmpty) {


nit: could you add an assert(sliceStart <= sliceEnd, s"sliceStart: $sliceStart sliceEnd: $sliceEnd") above batches.slice(sliceStart, sliceEnd) to make sure getBatch will not be called with wrong offsets.

zsxwing · 2018-01-12T22:26:21Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

    )
  }

+  test("union bug in failover") {


nit: test("SPARK-22956: currentPartitionOffsets should be set when no new data comes in")

zsxwing · 2018-01-12T22:31:54Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

+      StartStream(ProcessingTime(100), clock),
+      waitUntilBatchProcessed,
+      // 5 from smaller topic, 5 from bigger one
+      CheckAnswer(0, 1, 2, 3, 4, 100, 101, 102, 103, 104),


You can clean these codes a bit using the following snippet:

testStream(kafka)( StartStream(ProcessingTime(100), clock), waitUntilBatchProcessed, // 5 from smaller topic, 5 from bigger one CheckLastBatch((0 to 4) ++ (100 to 104): _*), AdvanceManualClock(100), waitUntilBatchProcessed, // 5 from smaller topic, 5 from bigger one CheckLastBatch((5 to 9) ++ (105 to 109): _*), AdvanceManualClock(100), waitUntilBatchProcessed, // smaller topic empty, 5 from bigger one CheckLastBatch(110 to 114: _*), StopStream, StartStream(ProcessingTime(100), clock), waitUntilBatchProcessed, // smallest now empty, 5 from bigger one CheckLastBatch(115 to 119: _*), AdvanceManualClock(100), waitUntilBatchProcessed, // smallest now empty, 5 from bigger one CheckLastBatch(120 to 124: _*) )

Cool, this made the code more cleaner.

SparkQA · 2018-01-15T06:58:05Z

Test build #86128 has finished for PR 20150 at commit bf8af29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-01-16T05:59:53Z

Thanks! Merging to master and 2.3.

xuanyuanking · 2018-01-16T06:01:34Z

Thanks for your review! Shixiong

## What changes were proposed in this pull request? This problem reported by yanlin-Lynn ivoson and LiangchangZ. Thanks! When we union 2 streams from kafka or other sources, while one of them have no continues data coming and in the same time task restart, this will cause an `IllegalStateException`. This mainly cause because the code in [MicroBatchExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190) , while one stream has no continues data, its comittedOffset same with availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` not properly handled in KafkaSource. Also, maybe we should also consider this scenario in other Source. ## How was this patch tested? Add a UT in KafkaSourceSuite.scala Author: Yuanjian Li <[email protected]> Closes #20150 from xuanyuanking/SPARK-22956. (cherry picked from commit 07ae39d) Signed-off-by: Shixiong Zhu <[email protected]>

SPARK-22956: Bug fix for 2 streams union failover scenario

aa3d7b7

fix import

fa64187

zsxwing requested changes Jan 12, 2018

View reviewed changes

Address comments

bf8af29

asfgit closed this in 07ae39d Jan 16, 2018

xuanyuanking deleted the SPARK-22956 branch January 16, 2018 08:22

[SPARK-22956][SS] Bug fix for 2 streams union failover scenario #20150

[SPARK-22956][SS] Bug fix for 2 streams union failover scenario #20150

Uh oh!

Conversation

xuanyuanking commented Jan 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 4, 2018

Uh oh!

SparkQA commented Jan 4, 2018

Uh oh!

xuanyuanking commented Jan 5, 2018

Uh oh!

xuanyuanking commented Jan 9, 2018

Uh oh!

zsxwing commented Jan 10, 2018

Uh oh!

xuanyuanking commented Jan 10, 2018

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

zsxwing Jan 12, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Jan 15, 2018

Choose a reason for hiding this comment

Uh oh!

zsxwing Jan 12, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Jan 15, 2018

Choose a reason for hiding this comment

Uh oh!

zsxwing Jan 12, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Jan 15, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2018

Uh oh!

zsxwing commented Jan 16, 2018

Uh oh!

xuanyuanking commented Jan 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants