Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Aug 23, 2018

What changes were proposed in this pull request?

When there are missing offsets, Kafka v2 source may return duplicated records when failOnDataLoss=false because it doesn't skip missing offsets.

This PR fixes the issue and also adds regression tests for all Kafka readers.

How was this patch tested?

New tests.

offsetRanges.zipWithIndex.map { case (o, i) => new KafkaSourceRDDPartition(i, o) }.toArray
}

override def count(): Long = offsetRanges.map(_.size).sum
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are never used as Dataset always uses this RDD:

and MapPartitionsRDD just calls the default RDD implementation. In addition, they may return wrong answers when failOnDataLoss=false. Hence, I just removed them.

}
}

class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with SharedSQLContext {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to KafkaDontFailOnDataLossSuite.scala

}
}

class KafkaSourceStressForDontFailOnDataLossSuite extends StreamTest with KafkaMissingOffsetsTest {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from KafkaMicroBatchSourceSuite.scala. I also moved the set up codes to KafkaMissingOffsetsTest to share with KafkaDontFailOnDataLossSuite.

@SparkQA
Copy link

SparkQA commented Aug 23, 2018

Test build #95177 has finished for PR 22207 at commit f2d4d67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2018

Test build #95179 has finished for PR 22207 at commit e968159.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val result = spark.table(table).as[String].collect().toList
assert(result.distinct.size === result.size, s"$result contains duplicated records")
// Make sure Kafka did remove some records so that this test is valid.
assert(result.size > 0 && result.size < 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you ensure that the above configure retention policy will not completely delete all records?

Copy link
Member Author

@zsxwing zsxwing Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked Kafka codes and it will keep at least one segment for a topic. I also did a simple test to make sure it will not delete all records: Added Thread.sleep(120000) after eventually(timeout(60.seconds)) { assert( testUtils.getEarliestOffsets(Set(topic)).head._2 > 0, "Kafka didn't delete records after 1 minute") } and the assertion still passed.

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. thanks for finding this bug. Just a few nits in my comments.

} else {
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", testUtils.brokerAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedup these options into map... just to make sure they are never in consistent.

.start()
try {
eventually(timeout(60.seconds)) {
assert(spark.table(table).as[String].collect().contains("49"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt processAllAvailable work in continuous processing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know it works!

protected def startStream(ds: Dataset[Int]) = {
ds.writeStream.foreach(new ForeachWriter[Int] {

override def open(partitionId: Long, version: Long): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make single line.

Thread.sleep(Random.nextInt(500))
}

override def close(errorOrNull: Throwable): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make single line.

@SparkQA
Copy link

SparkQA commented Aug 24, 2018

Test build #95224 has finished for PR 22207 at commit 3515275.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 24, 2018

Thanks! Merging to master and 2.3.

@asfgit asfgit closed this in 8bb9414 Aug 24, 2018
@zsxwing zsxwing deleted the SPARK-25214 branch August 24, 2018 20:57
@zsxwing
Copy link
Member Author

zsxwing commented Aug 24, 2018

I just realized the Kafka source v2 is not in 2.3 :)

asfgit pushed a commit that referenced this pull request Aug 25, 2018
…turn duplicated records when `failOnDataLoss=false`

## What changes were proposed in this pull request?

This is a follow up PR for #22207 to fix a potential flaky test. `processAllAvailable` doesn't work for continuous processing so we should not use it for a continuous query.

## How was this patch tested?

Jenkins.

Closes #22230 from zsxwing/SPARK-25214-2.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
bogdanrdc pushed a commit to bogdanrdc/spark that referenced this pull request Aug 28, 2018
…cated records when `failOnDataLoss=false`

## What changes were proposed in this pull request?

When there are missing offsets, Kafka v2 source may return duplicated records when `failOnDataLoss=false` because it doesn't skip missing offsets.

This PR fixes the issue and also adds regression tests for all Kafka readers.

## How was this patch tested?

New tests.

Closes apache#22207 from zsxwing/SPARK-25214.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
bogdanrdc pushed a commit to bogdanrdc/spark that referenced this pull request Aug 28, 2018
…turn duplicated records when `failOnDataLoss=false`

## What changes were proposed in this pull request?

This is a follow up PR for apache#22207 to fix a potential flaky test. `processAllAvailable` doesn't work for continuous processing so we should not use it for a continuous query.

## How was this patch tested?

Jenkins.

Closes apache#22230 from zsxwing/SPARK-25214-2.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants