[SPARK-24647][SS] Report KafkaStreamWriter's written min and max offs… #22143

vackosar · 2018-08-19T18:03:49Z

…ets via CustomMetrics.

What changes were proposed in this pull request?

Report KafkaStreamWriter's written min and max offsets via CustomMetrics. This is important for data lineage projects like Spline. Related issue: https://issues.apache.org/jira/browse/SPARK-24647

To be able to track data lineage for Structured Streaming (I intend to implement this to Open Source Project Spline), the monitoring needs to be able to not only to track where the data was read from but also where results were written to. This could be to my knowledge best implemented using monitoring StreamingQueryProgress. However currently written data offsets are not available on Sink or StreamWriter interface. Implementing as proposed would also bring symmetry to StreamingQueryProgress fields sources and sink.

How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

vackosar · 2018-08-19T18:52:15Z

@arunmahadevan @jose-torres @cloud-fan you may interested in this one.

koeninger · 2018-08-19T23:12:12Z

Jenkins, ok to test

SparkQA · 2018-08-19T23:41:19Z

Test build #94937 has finished for PR 22143 at commit 767a015.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class KafkaWriterCommitMessage(minOffset: KafkaSourceOffset, maxOffset: KafkaSourceOffset)

…ets via CustomMetrics.

SparkQA · 2018-08-20T05:36:25Z

Test build #94941 has finished for PR 22143 at commit c812eff.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class KafkaWriterCommitMessage(minOffset: KafkaSourceOffset, maxOffset: KafkaSourceOffset)

arunmahadevan

Went through at a high level and left a few comments.

Seems we want to report something like below as sink metrics.

"minOffset" : {
 "topic1" : {
 "0" : 44,  "1" : 95
 }
 },
 "maxOffset" : {
 "topic1" : {
 "0" : 50, "1": 100
 }

Before reviewing further would like to understand how is this useful since we were already planning to report the number of output rows in the sink metrics (i.e. what use cases can be solved by directly exposing the kafka offsets in addition to numoutputrows).
Also if we should report both max and min (assume the max of the current micro-batch would be the min for the next micro-batch)

arunmahadevan · 2018-08-20T17:03:28Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaStreamWriter.scala

 * don't need to really send one.
 */
-case object KafkaWriterCommitMessage extends WriterCommitMessage
+case class KafkaWriterCommitMessage(minOffset: KafkaSourceOffset, maxOffset: KafkaSourceOffset)


Its kind of odd that the writer commit message includes source offset. IMO, better to define a KafkaSinkOffset or if it can be common, something like KafkaOffsets.

I would have to rename the class itself to not add additional duplicate class. I would love to do that, it is just that I am not sure if it would be accepted.

arunmahadevan · 2018-08-20T17:05:37Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaStreamWriter.scala

+    KafkaWriterCustomMetrics(minMax._1, minMax._2)
+  }
+
+  private def collate(messages: Array[WriterCommitMessage]):


good to leave some comment on what this does. It seems to be computing the min/max offset per partition? If so choosing an apt name for that function would make it clearer.

Thanks, I will rename to something with minMax.

arunmahadevan · 2018-08-20T17:14:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

+  import scala.collection.JavaConverters._
+
+  protected val minOffsetAccumulator: collection.concurrent.Map[TopicPartition, Long] =
+    new ConcurrentHashMap[TopicPartition, Long]().asScala


why is this concurrent map?

This map is accessed in callbacks concurrently with respect to different partitions. Can be seen from call hierarchy and docs of Kafka's send method.

vackosar · 2018-08-20T19:32:57Z

@arunmahadevan min and max are used there can be other writers to same topic occurring in different job. The messages sent would then become interleaved and one would have to return large number of intervals to be accurate. This approach gives sufficient information where the data ended up being written, while being also resilient and simplistic. Would you recommend adding this as a Java Doc?

To explain motivation I updated description of this PR using description of the Jira. (To track data lineage we need to know where data was read from and written to at least approaximately.)

vackosar · 2018-10-16T15:43:31Z

@cloud-fan are you ok merging the PR?

vackosar · 2018-12-18T13:02:33Z

Currently this PR is awating for SPARK-24748 to be remerged again since it was reverted until Datasource API v2 is finished.

AmplabJenkins · 2019-11-20T08:30:27Z

Can one of the admins verify this patch?

github-actions · 2020-02-29T00:12:36Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-24647][SS] Report KafkaStreamWriter's written min and max offs…

c812eff

…ets via CustomMetrics.

vackosar force-pushed the feature/SPARK-24647-kafka-writer-offsets branch from 671bc81 to c812eff Compare August 20, 2018 05:06

arunmahadevan reviewed Aug 20, 2018

View reviewed changes

vackosar mentioned this pull request Dec 18, 2018

[SPARK-24748][SS] Support for reporting custom metrics via StreamingQuery Progress #21721

Closed

dongjoon-hyun added the STRUCTURED STREAMING label Jun 14, 2019

github-actions bot added the Stale label Feb 29, 2020

github-actions bot closed this Mar 1, 2020

[SPARK-24647][SS] Report KafkaStreamWriter's written min and max offs… #22143

[SPARK-24647][SS] Report KafkaStreamWriter's written min and max offs… #22143

Uh oh!

Conversation

vackosar commented Aug 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vackosar commented Aug 19, 2018

Uh oh!

koeninger commented Aug 19, 2018

Uh oh!

SparkQA commented Aug 19, 2018

Uh oh!

SparkQA commented Aug 20, 2018

Uh oh!

arunmahadevan left a comment

Choose a reason for hiding this comment

Uh oh!

arunmahadevan Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

vackosar Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

arunmahadevan Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

vackosar Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

arunmahadevan Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

vackosar Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

vackosar commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vackosar commented Oct 16, 2018

Uh oh!

vackosar commented Dec 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Nov 20, 2019

Uh oh!

github-actions bot commented Feb 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vackosar commented Aug 19, 2018 •

edited

Loading

vackosar commented Aug 20, 2018 •

edited

Loading

vackosar commented Dec 18, 2018 •

edited

Loading