[SPARK-26520][SQL] data source v2 API refactor (micro-batch read) #23430

cloud-fan · 2019-01-02T20:46:34Z

What changes were proposed in this pull request?

Following #23086, this PR does the API refactor for micro-batch read, w.r.t. the doc

The major changes:

rename XXXMicroBatchReadSupport to XXXMicroBatchStream
implement TableProvider, Table, ScanBuilder and Scan for streaming sources
at the beginning of micro-batch streaming execution, convert StreamingRelationV2 to StreamingDataSourceV2Relation directly, instead of StreamingExecutionRelation.

followup:
support operator pushdown for stream sources

How was this patch tested?

existing tests

SparkQA · 2019-01-02T21:01:44Z

Test build #100655 has finished for PR 23430 at commit 2fbecf5.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

arunmahadevan · 2019-01-02T22:40:14Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

Its cleaner to separate the Scan and the Builder.

I don't have a strong preference, currently all the streaming sources don't support operator pushdown, so it's easier to implement both of them together.

arunmahadevan · 2019-01-02T23:03:04Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Scan.java

IMO, a name like Batch (defined above) does not convey much and is inconsistent with MicroBatchStream. If these represent the physical scans (as the comments indicate) maybe rename it to BatchScan/MicroBatchScan and so on.

E.g.

public interface Scan { .. BatchScan toBatch() MicroBatchScan toMicroBatch() ContinuousScan toContinuous() }

And maybe then these scans could extend a marker interface like PhysicalScan to differentiate it from the logical Scan.

Another idea is to merge Scan and Batch/Stream, see the alternative 1 in the doc

In "alternative1" there is no equivalent Logical Scan? I was thinking we need the Scan (the logical scan) separate from physical scans.

Also if they don't inherit a common parent can it be passed to the DatasourceV2ScanExec ?

Anyways better to relook and rename as appropriate to keep the different ones (batch/micro-batch/continuous) have common pre/suffixes and denote what they mean.

arunmahadevan · 2019-01-02T23:09:38Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/SparkDataStream.java

Why do we need an additional level? Cant this be part of the micro batch physical scan (MicroBatchStream)

This is for ContinuousStream which will be added later. Please refer to the doc for more details.

Makes sense. A Stream (a stream of events) and Source typically imply different things. In this case the SparkDataStream looks to be more of a source specific thing than a stream.

arunmahadevan · 2019-01-02T23:28:21Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsMicroBatchRead.java

Instead of just being a marker interface, could these be a mixin interface for the Scan (that also defines the respective methods)? like:

public interface SupportsBatchRead extends Scan { Batch toBatch(); }

and so on. And if a Table supports Read one could query its Scan object to figure out the type if required.

We need to know the capability at the table level. It's too late to do it at the scan level, as creating a scan may be expensive.

Only thing is it doesn't enforce anything. A method like Table.supportedTypes() might also work.

@arunmahadevan, that's similar to the capabilities that we plan to add. Spark will query specific capabilities for a table to make determinations like this to cut down on the number of empty interfaces.

cloud-fan · 2019-01-03T03:01:10Z

...main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketMicroBatchStream.scala

This is moved to a new file: https://github.com/apache/spark/pull/23430/files#diff-c223080834806efd92efc860656b60cdR34

cloud-fan · 2019-01-03T17:40:42Z

cc @jose-torres @tdas @gatorsmile @rdblue @gengliangwang

SparkQA · 2019-01-03T20:30:05Z

Test build #100699 has finished for PR 23430 at commit af5fcc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-04T08:05:02Z

Test build #100724 has finished for PR 23430 at commit a869b20.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-04T11:08:33Z

retest this please

SparkQA · 2019-01-04T14:59:31Z

Test build #100729 has finished for PR 23430 at commit a869b20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-07T08:05:02Z

Test build #100867 has finished for PR 23430 at commit 0866233.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-07T10:55:33Z

retest this please

SparkQA · 2019-01-07T12:59:56Z

Test build #100879 has finished for PR 23430 at commit 0866233.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-07T14:33:05Z

retest this please

SparkQA · 2019-01-07T18:36:31Z

Test build #100890 has finished for PR 23430 at commit 0866233.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-18T07:39:01Z

Test build #101389 has finished for PR 23430 at commit 5a4047e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres

LGTM. Note that I haven't really thought about the naming, just tried to confirm that it follows the community consensus in the doc.

jose-torres · 2019-01-21T00:54:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

-          StreamingExecutionRelation(readSupport, output)(sparkSession)
+          logInfo(s"Reading table [$table] from DataSourceV2 named '$dsName' [$ds]")
+          val dsOptions = new DataSourceOptions(options.asJava)
+          // TODO: operator pushdown.


Is this an urgent TODO, or does the existing v2 interface already not handle pushdown. (I probably should know this, but it's been a while since the original implementation.)

It's not urgent, as no builtin streaming source supports pushdown yet. The ds v2 API can handle pushdown, and our batch sources do support it.

Please open JIRAs for all these todo items. Thanks!

gatorsmile · 2019-01-21T17:26:26Z

retest this please

SparkQA · 2019-01-21T21:07:27Z

Test build #101489 has finished for PR 23430 at commit 5a4047e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-01-21T22:28:03Z

LGTM

Thanks! Merged to master.

## What changes were proposed in this pull request? Following apache#23086, this PR does the API refactor for micro-batch read, w.r.t. the [doc](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing) The major changes: 1. rename `XXXMicroBatchReadSupport` to `XXXMicroBatchReadStream` 2. implement `TableProvider`, `Table`, `ScanBuilder` and `Scan` for streaming sources 3. at the beginning of micro-batch streaming execution, convert `StreamingRelationV2` to `StreamingDataSourceV2Relation` directly, instead of `StreamingExecutionRelation`. followup: support operator pushdown for stream sources ## How was this patch tested? existing tests Closes apache#23430 from cloud-fan/micro-batch. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Following apache#23430, this PR does the API refactor for continuous read, w.r.t. the [doc](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing) The major changes: 1. rename `XXXContinuousReadSupport` to `XXXContinuousStream` 2. at the beginning of continuous streaming execution, convert `StreamingRelationV2` to `StreamingDataSourceV2Relation` directly, instead of `StreamingExecutionRelation`. 3. remove all the hacks as we have finished all the read side API refactor ## How was this patch tested? existing tests Closes apache#23619 from cloud-fan/continuous. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]>

arunmahadevan reviewed Jan 2, 2019

View reviewed changes

cloud-fan commented Jan 3, 2019

View reviewed changes

cloud-fan force-pushed the micro-batch branch from 2fbecf5 to af5fcc9 Compare January 3, 2019 17:39

cloud-fan force-pushed the micro-batch branch from af5fcc9 to a869b20 Compare January 4, 2019 07:34

cloud-fan force-pushed the micro-batch branch from a869b20 to 0866233 Compare January 7, 2019 07:57

data source v2 API refactor (micro-batch read)

5a4047e

cloud-fan force-pushed the micro-batch branch from 0866233 to 5a4047e Compare January 18, 2019 03:51

jose-torres reviewed Jan 21, 2019

View reviewed changes

asfgit closed this in 098a2c4 Jan 21, 2019

cloud-fan mentioned this pull request Jan 23, 2019

[SPARK-26695][SQL] data source v2 API refactor - continuous read #23619

Closed

[SPARK-26520][SQL] data source v2 API refactor (micro-batch read) #23430

[SPARK-26520][SQL] data source v2 API refactor (micro-batch read) #23430

Uh oh!

Conversation

cloud-fan commented Jan 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arunmahadevan Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 3, 2019

Uh oh!

SparkQA commented Jan 3, 2019

Uh oh!

SparkQA commented Jan 4, 2019

Uh oh!

cloud-fan commented Jan 4, 2019

Uh oh!

SparkQA commented Jan 4, 2019

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

cloud-fan commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

cloud-fan commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 18, 2019

Uh oh!

jose-torres left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 21, 2019

Uh oh!

SparkQA commented Jan 21, 2019

Uh oh!

gatorsmile commented Jan 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

cloud-fan commented Jan 2, 2019 •

edited

Loading

arunmahadevan Jan 3, 2019 •

edited

Loading