[SPARK-25214][SS]Fix the issue that Kafka v2 source may return duplicated records when `failOnDataLoss=false` #22207

zsxwing · 2018-08-23T18:17:18Z

What changes were proposed in this pull request?

When there are missing offsets, Kafka v2 source may return duplicated records when failOnDataLoss=false because it doesn't skip missing offsets.

This PR fixes the issue and also adds regression tests for all Kafka readers.

How was this patch tested?

New tests.

zsxwing · 2018-08-23T18:19:25Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala

    offsetRanges.zipWithIndex.map { case (o, i) => new KafkaSourceRDDPartition(i, o) }.toArray
  }

-  override def count(): Long = offsetRanges.map(_.size).sum


These methods are never used as Dataset always uses this RDD:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

Line 113 in 2a0a8f7

rdd.mapPartitionsInternal { iter =>

and MapPartitionsRDD just calls the default RDD implementation. In addition, they may return wrong answers when failOnDataLoss=false. Hence, I just removed them.

zsxwing · 2018-08-23T18:19:46Z

...kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala

  }
 }
-
-class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with SharedSQLContext {


Moved to KafkaDontFailOnDataLossSuite.scala

zsxwing · 2018-08-23T18:21:51Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala

+  }
+}
+
+class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with KafkaMissingOffsetsTest {


Copied from KafkaMicroBatchSourceSuite.scala. I also moved the set up codes to KafkaMissingOffsetsTest to share with KafkaDontFailOnDataLossSuite.

SparkQA · 2018-08-23T18:49:48Z

Test build #95177 has finished for PR 22207 at commit f2d4d67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-23T19:21:41Z

Test build #95179 has finished for PR 22207 at commit e968159.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-08-24T05:44:37Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala

+      val result = spark.table(table).as[String].collect().toList
+      assert(result.distinct.size === result.size, s"$result contains duplicated records")
+      // Make sure Kafka did remove some records so that this test is valid.
+      assert(result.size > 0 && result.size < 50)


How do you ensure that the above configure retention policy will not completely delete all records?

I checked Kafka codes and it will keep at least one segment for a topic. I also did a simple test to make sure it will not delete all records: Added Thread.sleep(120000) after eventually(timeout(60.seconds)) { assert( testUtils.getEarliestOffsets(Set(topic)).head._2 > 0, "Kafka didn't delete records after 1 minute") } and the assertion still passed.

tdas

Looks good. thanks for finding this bug. Just a few nits in my comments.

tdas · 2018-08-24T05:46:55Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala

+      } else {
+        spark.read
+          .format("kafka")
+          .option("kafka.bootstrap.servers", testUtils.brokerAddress)


dedup these options into map... just to make sure they are never in consistent.

tdas · 2018-08-24T17:51:28Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala

+        .start()
+      try {
+        eventually(timeout(60.seconds)) {
+          assert(spark.table(table).as[String].collect().contains("49"))


doesnt processAllAvailable work in continuous processing?

I didn't know it works!

tdas · 2018-08-24T17:54:20Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala

+  protected def startStream(ds: Dataset[Int]) = {
+    ds.writeStream.foreach(new ForeachWriter[Int] {
+
+      override def open(partitionId: Long, version: Long): Boolean = {


nit: make single line.

tdas · 2018-08-24T17:54:30Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala

+        Thread.sleep(Random.nextInt(500))
+      }
+
+      override def close(errorOrNull: Throwable): Unit = {


nit: make single line.

SparkQA · 2018-08-24T18:43:28Z

Test build #95224 has finished for PR 22207 at commit 3515275.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-08-24T18:59:41Z

Thanks! Merging to master and ~~2.3~~.

zsxwing · 2018-08-24T20:58:10Z

I just realized the Kafka source v2 is not in 2.3 :)

…turn duplicated records when `failOnDataLoss=false` ## What changes were proposed in this pull request? This is a follow up PR for #22207 to fix a potential flaky test. `processAllAvailable` doesn't work for continuous processing so we should not use it for a continuous query. ## How was this patch tested? Jenkins. Closes #22230 from zsxwing/SPARK-25214-2. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

…cated records when `failOnDataLoss=false` ## What changes were proposed in this pull request? When there are missing offsets, Kafka v2 source may return duplicated records when `failOnDataLoss=false` because it doesn't skip missing offsets. This PR fixes the issue and also adds regression tests for all Kafka readers. ## How was this patch tested? New tests. Closes apache#22207 from zsxwing/SPARK-25214. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

…turn duplicated records when `failOnDataLoss=false` ## What changes were proposed in this pull request? This is a follow up PR for apache#22207 to fix a potential flaky test. `processAllAvailable` doesn't work for continuous processing so we should not use it for a continuous query. ## How was this patch tested? Jenkins. Closes apache#22230 from zsxwing/SPARK-25214-2. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

Fix the issue that Kafka v2 source may return duplicated records when is

f2d4d67

zsxwing commented Aug 23, 2018

View reviewed changes

style

e968159

tdas reviewed Aug 24, 2018

View reviewed changes

tdas approved these changes Aug 24, 2018

View reviewed changes

address

3515275

asfgit closed this in 8bb9414 Aug 24, 2018

zsxwing deleted the SPARK-25214 branch August 24, 2018 20:57

zsxwing mentioned this pull request Aug 24, 2018

[SPARK-25214][SS][FOLLOWUP]Fix the issue that Kafka v2 source may return duplicated records when failOnDataLoss=false #22230

Closed

[SPARK-25214][SS]Fix the issue that Kafka v2 source may return duplicated records when failOnDataLoss=false #22207

[SPARK-25214][SS]Fix the issue that Kafka v2 source may return duplicated records when failOnDataLoss=false #22207

Uh oh!

Conversation

zsxwing commented Aug 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing Aug 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2018

Uh oh!

zsxwing commented Aug 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Aug 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-25214][SS]Fix the issue that Kafka v2 source may return duplicated records when `failOnDataLoss=false` #22207

[SPARK-25214][SS]Fix the issue that Kafka v2 source may return duplicated records when `failOnDataLoss=false` #22207

zsxwing commented Aug 23, 2018 •

edited

Loading

zsxwing Aug 24, 2018 •

edited

Loading

zsxwing commented Aug 24, 2018 •

edited

Loading