[SPARK-24127][SS] Continuous text socket source#21199
[SPARK-24127][SS] Continuous text socket source#21199arunmahadevan wants to merge 9 commits intoapache:masterfrom
Conversation
|
ping @jerryshao @tdas @jose-torres @HeartSaVioR for inputs. |
|
ok to test |
|
add to whitelist |
|
Test build #90349 has finished for PR 21199 at commit
|
|
I won't be able to look at this in detail until next week. In general, I think this is a great source to have available. I wonder if it'd be worthwhile to try and abstract the record forwarding RPCs from here and ContinuousMemoryStream together. |
There was a problem hiding this comment.
context.reply(record.map(r => if (includeTimestamp) Row(r) else Row(r._1)))
Just in one line is ok I think.
There was a problem hiding this comment.
If I understand right, this commit will never enter in your added test case.
|
I was thinking if it is too overkill to receive data in the driver side and publish them to the executors via RPC? This might give user a wrong impression that data should be received in the driver side and published to the executors again. Just my two cents. |
|
I think that's unavoidable if we want to have a socket source. The microbatch socket source has the same thing going on. I'd expect most people looking into implementation details of data sources will understand that they ought to read from executors in general. |
|
yes, this similar to the micro batch socket source where the driver opens a single socket connection to read data from "nc". I would expect this pattern to be used only for debug and test sources and not so much for the real ones. We can add some code comments to clarify this. |
b3a42f0 to
b962c3d
Compare
|
Test build #90606 has finished for PR 21199 at commit
|
|
Test build #90603 has finished for PR 21199 at commit
|
|
A gentle ping for review @jose-torres , @jerryshao , @xuanyuanking |
There was a problem hiding this comment.
nit: probably better to elaborate on what these are for, Seq[Seq[Any]] and Object aren't very informative types
There was a problem hiding this comment.
Added more comments to clarify.
There was a problem hiding this comment.
Can't we use testStream for these tests?
There was a problem hiding this comment.
Probably we could use, but the addSocketData does not work for continuous source and thought the reader offsets could be validated better this way. (followed the approach in RateStreamProviderSuite)
|
Test build #90744 has finished for PR 21199 at commit
|
|
Test build #90755 has finished for PR 21199 at commit
|
|
ping @tdas @jose-torres |
|
ok to test |
|
Test build #91615 has finished for PR 21199 at commit
|
|
@arunmahadevan |
68c5eed to
76512d8
Compare
|
Test build #93797 has finished for PR 21199 at commit
|
76512d8 to
a069d01
Compare
|
@HeartSaVioR , rebased with master. ping @jose-torres @tdas @zsxwing for review. |
|
Test build #93801 has finished for PR 21199 at commit
|
|
@arunmahadevan Thanks for rebasing. I'll take a look. |
HeartSaVioR
left a comment
There was a problem hiding this comment.
The code change looks good overall.
Left comments mostly suggesting about deduplicating code between micro-batch and continuous and kind of defensive programming. Some comments could be shown as individual's preference and I'm definitely sure to follow Spark's preference.
There was a problem hiding this comment.
While the values are good to be placed with companion object, it looks like redundant to have them in both micro-batch and continuous, so might be better to have common object to place this.
We may need to find more spots to deduplicate between micro-batch and continuous for socket.
There was a problem hiding this comment.
The companion object can be shared. But overall I guess we need to come up better interfaces such that the micro and continuous sources could share more code. I would investigate this out of the scope of this PR.
There was a problem hiding this comment.
I'd rather make it safer via either one of two approaches:
- assert partition offsets has all partition ids, 0 ~ numPartitions - 1
- add partition id in list element of TextSocketOffset as RateStreamContinuousReader and RateStreamOffset did
Personally I prefer option 2, but either is fine for me. Not sure which is more preferred for committers/Spark community.
There was a problem hiding this comment.
There is an assertion above assert(offsets.length == numPartitions) (option 1). RateSource also uses similar validation. I am not sure if adding the index adds any value here since socket source does not support recovery. Even in Rate source the partition values stored are 1...numPartitions-1 and this can already be inferred by the index of the offset in the array.
There was a problem hiding this comment.
Yeah, agreed. I'm OK if same implication is used in other places.
There was a problem hiding this comment.
Micro-batch Socket Reader validates the type of end and calls sys.error with some informational message: we may be better to give meaningful message like this.
Btw, my 2 cents, more specific exception is always better, so I'm +1 to throw IllegalStateException rather than calling sys.error which throws RuntimeException, like below lines.
There was a problem hiding this comment.
Ideally we could deduplicate the code between continuous / micro-batch, via modifying read thread to receive a handler for new line and let each reader handles the new line accordingly with proper lock. With this change we can use same read thread for both continuous and micro-batch reader.
There was a problem hiding this comment.
We could probably refactor and use common code but the usages are slightly different and I would like to do this outside the scope of this PR. I would like to identify and pull out some generic APIs that both micro-batch and continuous sources can implement so that such duplication can be avoided in general. With the current approach there are always two separate implementations for each type and and the chance of duplication is more.
There was a problem hiding this comment.
Yeah you're planning to investigate and touch APIs then it sounds really good. May worth to file a new issue?
There was a problem hiding this comment.
There was a problem hiding this comment.
nit: according to style guide, this may need to be written as follow
.map { rec =>
if (includeTimestamp) {
...
or even
ContinuousRecordPartitionOffset(partitionId, currentOffset))).map { rec =>
if (includeTimestamp) {
...
https://github.com/databricks/scala-style-guide#anonymous-methods
There was a problem hiding this comment.
Looks like unused import
There was a problem hiding this comment.
Maybe adding more line comments in code block would help understanding the test code easier, like intentionally committing in the middle of range, etc.
a069d01 to
f4a39d9
Compare
|
@HeartSaVioR , Addressed your comments. Let me know if I missed something. Also rebased and had to change more code to use the new interfaces. I hope if we can speed up the review cycles in general than leaving PRs to hibernation for a while and then the developer will loose the context and other things would have changed in the meanwhile. |
|
The change looks broadly good (and important) to me. I'll defer to @HeartSaVioR wrt the in-depth review; let me know if there are any specific parts I should to take a look at. |
|
Test build #93921 has finished for PR 21199 at commit
|
|
retest this please |
HeartSaVioR
left a comment
There was a problem hiding this comment.
LGTM given we are planning to tackle deduplication of codebase between micro-batch and continuous later.
|
Test build #93939 has finished for PR 21199 at commit
|
|
retest this please |
|
Test build #94011 has finished for PR 21199 at commit
|
|
retest this please |
|
Test build #94148 has finished for PR 21199 at commit
|
|
retest this please |
|
Test build #94321 has finished for PR 21199 at commit
|
|
@HyukjinKwon this has been open for a while, would you mind taking this forward? |
|
retest this please |
|
@jose-torres and @HeartSaVioR, is it good to go? |
|
Test build #94411 has finished for PR 21199 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
Support for text socket stream in spark structured streaming "continuous" mode. This is roughly based on the idea of ContinuousMemoryStream where the executor queries the data from driver over an RPC endpoint.
This makes it possible to create Structured streaming continuous pipeline to ingest data via "nc" and run examples.
How was this patch tested?
Unit test and ran spark examples in structured streaming continuous mode.
Please review http://spark.apache.org/contributing.html before opening a pull request.