Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 2, 2019

What changes were proposed in this pull request?

Following #23086, this PR does the API refactor for micro-batch read, w.r.t. the doc

The major changes:

  1. rename XXXMicroBatchReadSupport to XXXMicroBatchStream
  2. implement TableProvider, Table, ScanBuilder and Scan for streaming sources
  3. at the beginning of micro-batch streaming execution, convert StreamingRelationV2 to StreamingDataSourceV2Relation directly, instead of StreamingExecutionRelation.

followup:
support operator pushdown for stream sources

How was this patch tested?

existing tests

@SparkQA
Copy link

SparkQA commented Jan 2, 2019

Test build #100655 has finished for PR 23430 at commit 2fbecf5.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its cleaner to separate the Scan and the Builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, currently all the streaming sources don't support operator pushdown, so it's easier to implement both of them together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a name like Batch (defined above) does not convey much and is inconsistent with MicroBatchStream. If these represent the physical scans (as the comments indicate) maybe rename it to BatchScan/MicroBatchScan and so on.

E.g.

public interface Scan {
  ..
  BatchScan toBatch()
  MicroBatchScan toMicroBatch()
  ContinuousScan toContinuous()
}

And maybe then these scans could extend a marker interface like PhysicalScan to differentiate it from the logical Scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is to merge Scan and Batch/Stream, see the alternative 1 in the doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "alternative1" there is no equivalent Logical Scan? I was thinking we need the Scan (the logical scan) separate from physical scans.

Also if they don't inherit a common parent can it be passed to the DatasourceV2ScanExec ?

Anyways better to relook and rename as appropriate to keep the different ones (batch/micro-batch/continuous) have common pre/suffixes and denote what they mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an additional level? Cant this be part of the micro batch physical scan (MicroBatchStream)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for ContinuousStream which will be added later. Please refer to the doc for more details.

Copy link
Contributor

@arunmahadevan arunmahadevan Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. A Stream (a stream of events) and Source typically imply different things. In this case the SparkDataStream looks to be more of a source specific thing than a stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just being a marker interface, could these be a mixin interface for the Scan (that also defines the respective methods)? like:

public interface SupportsBatchRead extends Scan {
   Batch toBatch();
}

and so on. And if a Table supports Read one could query its Scan object to figure out the type if required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to know the capability at the table level. It's too late to do it at the scan level, as creating a scan may be expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing is it doesn't enforce anything. A method like Table.supportedTypes() might also work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunmahadevan, that's similar to the capabilities that we plan to add. Spark will query specific capabilities for a table to make determinations like this to cut down on the number of empty interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Jan 3, 2019

Test build #100699 has finished for PR 23430 at commit af5fcc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2019

Test build #100724 has finished for PR 23430 at commit a869b20.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 4, 2019

Test build #100729 has finished for PR 23430 at commit a869b20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100867 has finished for PR 23430 at commit 0866233.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100879 has finished for PR 23430 at commit 0866233.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100890 has finished for PR 23430 at commit 0866233.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2019

Test build #101389 has finished for PR 23430 at commit 5a4047e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jose-torres jose-torres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Note that I haven't really thought about the naming, just tried to confirm that it follows the community consensus in the doc.

StreamingExecutionRelation(readSupport, output)(sparkSession)
logInfo(s"Reading table [$table] from DataSourceV2 named '$dsName' [$ds]")
val dsOptions = new DataSourceOptions(options.asJava)
// TODO: operator pushdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an urgent TODO, or does the existing v2 interface already not handle pushdown. (I probably should know this, but it's been a while since the original implementation.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not urgent, as no builtin streaming source supports pushdown yet. The ds v2 API can handle pushdown, and our batch sources do support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open JIRAs for all these todo items. Thanks!

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 21, 2019

Test build #101489 has finished for PR 23430 at commit 5a4047e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 098a2c4 Jan 21, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Following apache#23086, this PR does the API refactor for micro-batch read, w.r.t. the [doc](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing)

The major changes:
1. rename `XXXMicroBatchReadSupport` to `XXXMicroBatchReadStream`
2. implement `TableProvider`, `Table`, `ScanBuilder` and `Scan` for streaming sources
3. at the beginning of micro-batch streaming execution, convert `StreamingRelationV2` to `StreamingDataSourceV2Relation` directly, instead of `StreamingExecutionRelation`.

followup:
support operator pushdown for stream sources

## How was this patch tested?

existing tests

Closes apache#23430 from cloud-fan/micro-batch.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Following apache#23430, this PR does the API refactor for continuous read, w.r.t. the [doc](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing)

The major changes:
1. rename `XXXContinuousReadSupport` to `XXXContinuousStream`
2. at the beginning of continuous streaming execution, convert `StreamingRelationV2` to `StreamingDataSourceV2Relation` directly, instead of `StreamingExecutionRelation`.
3. remove all the hacks as we have finished all the read side API refactor

## How was this patch tested?

existing tests

Closes apache#23619 from cloud-fan/continuous.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants