[SPARK-24127][SS] Continuous text socket source by arunmahadevan · Pull Request #21199 · apache/spark

arunmahadevan · 2018-04-30T21:50:58Z

What changes were proposed in this pull request?

Support for text socket stream in spark structured streaming "continuous" mode. This is roughly based on the idea of ContinuousMemoryStream where the executor queries the data from driver over an RPC endpoint.

This makes it possible to create Structured streaming continuous pipeline to ingest data via "nc" and run examples.

How was this patch tested?

Unit test and ran spark examples in structured streaming continuous mode.

Please review http://spark.apache.org/contributing.html before opening a pull request.

arunmahadevan · 2018-05-07T17:28:13Z

ping @jerryshao @tdas @jose-torres @HeartSaVioR for inputs.

HyukjinKwon · 2018-05-08T01:02:19Z

ok to test

HyukjinKwon · 2018-05-08T01:02:24Z

add to whitelist

SparkQA · 2018-05-08T02:49:02Z

Test build #90349 has finished for PR 21199 at commit f010943.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-05-08T06:31:29Z

I won't be able to look at this in detail until next week.

In general, I think this is a great source to have available. I wonder if it'd be worthwhile to try and abstract the record forwarding RPCs from here and ContinuousMemoryStream together.

xuanyuanking · 2018-05-08T15:15:19Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

context.reply(record.map(r => if (includeTimestamp) Row(r) else Row(r._1)))

Just in one line is ok I think.

xuanyuanking · 2018-05-08T15:17:29Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

If I understand right, this commit will never enter in your added test case.

jerryshao · 2018-05-14T06:50:59Z

I was thinking if it is too overkill to receive data in the driver side and publish them to the executors via RPC? This might give user a wrong impression that data should be received in the driver side and published to the executors again.

Just my two cents.

jose-torres · 2018-05-14T18:01:34Z

I think that's unavoidable if we want to have a socket source. The microbatch socket source has the same thing going on. I'd expect most people looking into implementation details of data sources will understand that they ought to read from executors in general.

arunmahadevan · 2018-05-14T18:13:22Z

yes, this similar to the micro batch socket source where the driver opens a single socket connection to read data from "nc". I would expect this pattern to be used only for debug and test sources and not so much for the real ones. We can add some code comments to clarify this.

SparkQA · 2018-05-14T22:37:52Z

Test build #90606 has finished for PR 21199 at commit b962c3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-14T22:54:20Z

Test build #90603 has finished for PR 21199 at commit b3a42f0.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class ContinuousRecordPartitionOffset(partitionId: Int, offset: Int) extends PartitionOffset
case class GetRecord(offset: ContinuousRecordPartitionOffset)
class ContinuousRecordEndpoint(buckets: Seq[Seq[Any]], lock: Object)

arunmahadevan · 2018-05-17T18:30:11Z

A gentle ping for review @jose-torres , @jerryshao , @xuanyuanking

jose-torres · 2018-05-17T18:43:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ContinuousRecordEndpoint.scala

nit: probably better to elaborate on what these are for, Seq[Seq[Any]] and Object aren't very informative types

Added more comments to clarify.

jose-torres · 2018-05-17T18:48:03Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala

Can't we use testStream for these tests?

Probably we could use, but the addSocketData does not work for continuous source and thought the reader offsets could be validated better this way. (followed the approach in RateStreamProviderSuite)

SparkQA · 2018-05-17T20:18:50Z

Test build #90744 has finished for PR 21199 at commit 242bcdb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TextSocketContinuousInputPartition(
class TextSocketContinuousInputPartitionReader(

SparkQA · 2018-05-18T01:06:45Z

Test build #90755 has finished for PR 21199 at commit 68c5eed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

arunmahadevan · 2018-05-31T23:34:17Z

ping @tdas @jose-torres

HyukjinKwon · 2018-06-09T09:08:16Z

ok to test

SparkQA · 2018-06-09T12:59:09Z

Test build #91615 has finished for PR 21199 at commit 68c5eed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-07-30T06:57:32Z

@arunmahadevan
Sorry I forgot to review this so far. Could you fix merge conflicts? I'd pull the code to the local and review since the code diff is not small.

SparkQA · 2018-07-30T17:20:11Z

Test build #93797 has finished for PR 21199 at commit 76512d8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

arunmahadevan · 2018-07-30T18:26:44Z

@HeartSaVioR , rebased with master.

ping @jose-torres @tdas @zsxwing for review.

SparkQA · 2018-07-30T21:24:51Z

Test build #93801 has finished for PR 21199 at commit a069d01.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader

HeartSaVioR · 2018-07-30T23:38:48Z

@arunmahadevan Thanks for rebasing. I'll take a look.

HeartSaVioR

The code change looks good overall.

Left comments mostly suggesting about deduplicating code between micro-batch and continuous and kind of defensive programming. Some comments could be shown as individual's preference and I'm definitely sure to follow Spark's preference.

HeartSaVioR · 2018-07-30T23:53:21Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

While the values are good to be placed with companion object, it looks like redundant to have them in both micro-batch and continuous, so might be better to have common object to place this.

We may need to find more spots to deduplicate between micro-batch and continuous for socket.

The companion object can be shared. But overall I guess we need to come up better interfaces such that the micro and continuous sources could share more code. I would investigate this out of the scope of this PR.

HeartSaVioR · 2018-07-31T01:26:36Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

I'd rather make it safer via either one of two approaches:

assert partition offsets has all partition ids, 0 ~ numPartitions - 1

add partition id in list element of TextSocketOffset as RateStreamContinuousReader and RateStreamOffset did

Personally I prefer option 2, but either is fine for me. Not sure which is more preferred for committers/Spark community.

There is an assertion above assert(offsets.length == numPartitions) (option 1). RateSource also uses similar validation. I am not sure if adding the index adds any value here since socket source does not support recovery. Even in Rate source the partition values stored are 1...numPartitions-1 and this can already be inferred by the index of the offset in the array.

Yeah, agreed. I'm OK if same implication is used in other places.

HeartSaVioR · 2018-07-31T03:19:30Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

Micro-batch Socket Reader validates the type of end and calls sys.error with some informational message: we may be better to give meaningful message like this.

Btw, my 2 cents, more specific exception is always better, so I'm +1 to throw IllegalStateException rather than calling sys.error which throws RuntimeException, like below lines.

HeartSaVioR · 2018-07-31T03:30:34Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

Ideally we could deduplicate the code between continuous / micro-batch, via modifying read thread to receive a handler for new line and let each reader handles the new line accordingly with proper lock. With this change we can use same read thread for both continuous and micro-batch reader.

We could probably refactor and use common code but the usages are slightly different and I would like to do this outside the scope of this PR. I would like to identify and pull out some generic APIs that both micro-batch and continuous sources can implement so that such duplication can be avoided in general. With the current approach there are always two separate implementations for each type and and the chance of duplication is more.

Yeah you're planning to investigate and touch APIs then it sounds really good. May worth to file a new issue?

filed: https://issues.apache.org/jira/browse/SPARK-25000

HeartSaVioR · 2018-07-31T03:40:02Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

nit: according to style guide, this may need to be written as follow

.map { rec => if (includeTimestamp) { ...

or even

ContinuousRecordPartitionOffset(partitionId, currentOffset))).map { rec => if (includeTimestamp) { ...

https://github.com/databricks/scala-style-guide#anonymous-methods

HeartSaVioR · 2018-07-31T03:52:44Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala

Looks like unused import

HeartSaVioR · 2018-07-31T03:55:12Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketStreamSuite.scala

Maybe adding more line comments in code block would help understanding the test code easier, like intentionally committing in the middle of range, etc.

arunmahadevan · 2018-08-02T01:26:50Z

@HeartSaVioR , Addressed your comments. Let me know if I missed something. Also rebased and had to change more code to use the new interfaces.

I hope if we can speed up the review cycles in general than leaving PRs to hibernation for a while and then the developer will loose the context and other things would have changed in the meanwhile.

jose-torres · 2018-08-02T02:57:00Z

The change looks broadly good (and important) to me. I'll defer to @HeartSaVioR wrt the in-depth review; let me know if there are any specific parts I should to take a look at.

SparkQA · 2018-08-02T03:26:44Z

Test build #93921 has finished for PR 21199 at commit f4a39d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader with Logging

HyukjinKwon · 2018-08-02T03:56:53Z

retest this please

HeartSaVioR

LGTM given we are planning to tackle deduplication of codebase between micro-batch and continuous later.

SparkQA · 2018-08-02T07:05:01Z

Test build #93939 has finished for PR 21199 at commit f4a39d9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader with Logging

arunmahadevan · 2018-08-02T14:05:07Z

retest this please

SparkQA · 2018-08-02T16:10:56Z

Test build #94011 has finished for PR 21199 at commit f4a39d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader with Logging

arunmahadevan · 2018-08-03T16:01:58Z

retest this please

SparkQA · 2018-08-03T17:54:36Z

Test build #94148 has finished for PR 21199 at commit f4a39d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader with Logging

arunmahadevan · 2018-08-06T22:10:43Z

retest this please

SparkQA · 2018-08-07T01:39:12Z

Test build #94321 has finished for PR 21199 at commit f4a39d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader with Logging

arunmahadevan · 2018-08-07T17:51:41Z

@HyukjinKwon this has been open for a while, would you mind taking this forward?

HyukjinKwon · 2018-08-08T08:16:30Z

retest this please

HyukjinKwon · 2018-08-08T08:33:56Z

@jose-torres and @HeartSaVioR, is it good to go?

SparkQA · 2018-08-08T11:53:26Z

Test build #94411 has finished for PR 21199 at commit f4a39d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TextSocketContinuousReader(options: DataSourceOptions) extends ContinuousReader with Logging

HyukjinKwon · 2018-08-10T07:53:03Z

Merged to master.

xuanyuanking reviewed May 8, 2018

View reviewed changes

arunmahadevan force-pushed the SPARK-24127 branch from b3a42f0 to b962c3d Compare May 14, 2018 18:53

jose-torres reviewed May 17, 2018

View reviewed changes

arunmahadevan force-pushed the SPARK-24127 branch from 68c5eed to 76512d8 Compare July 30, 2018 17:11

arunmahadevan force-pushed the SPARK-24127 branch from 76512d8 to a069d01 Compare July 30, 2018 18:26

HeartSaVioR reviewed Jul 31, 2018

View reviewed changes

arunmahadevan added 3 commits August 1, 2018 15:28

SPARK-24127: Continuous text socket source

2022323

SPARK-24127: Added unit test

cbd5a9c

SPARK-24127: More unit tests

4354bdb

arunmahadevan added 6 commits August 1, 2018 15:32

SPARK-24127: Refactor the RPC endpoint

c06762c

SPARK-24127: rebase and refactor to handle the DataReader* renames

3ce3283

SPARK-24127: Rename to match ReaderFactory -> InputPartition

36ca6c9

SPARK-24127: Added more code comments

9259ae8

SPARK-24127: rebase and fix issues

e3e86df

SPARK-24127: address review comments

f4a39d9

arunmahadevan force-pushed the SPARK-24127 branch from a069d01 to f4a39d9 Compare August 2, 2018 01:23

HeartSaVioR approved these changes Aug 2, 2018

View reviewed changes

HyukjinKwon approved these changes Aug 10, 2018

View reviewed changes

asfgit closed this in 9abe09b Aug 10, 2018

Conversation

arunmahadevan commented Apr 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

arunmahadevan commented May 7, 2018

Uh oh!

HyukjinKwon commented May 8, 2018

Uh oh!

HyukjinKwon commented May 8, 2018

Uh oh!

SparkQA commented May 8, 2018

Uh oh!

jose-torres commented May 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryshao commented May 14, 2018

Uh oh!

jose-torres commented May 14, 2018

Uh oh!

arunmahadevan commented May 14, 2018

Uh oh!

SparkQA commented May 14, 2018

Uh oh!

SparkQA commented May 14, 2018

Uh oh!

arunmahadevan commented May 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 17, 2018

Uh oh!

SparkQA commented May 18, 2018

Uh oh!

arunmahadevan commented May 31, 2018

Uh oh!

HyukjinKwon commented Jun 9, 2018

Uh oh!

SparkQA commented Jun 9, 2018

Uh oh!

HeartSaVioR commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

arunmahadevan commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

HeartSaVioR commented Jul 30, 2018

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Aug 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

arunmahadevan commented Apr 30, 2018 •

edited

Loading

HeartSaVioR Aug 2, 2018 •

edited

Loading