[SPARK-24990][SQL] merge ReadSupport and ReadSupportWithSchema #21946

cloud-fan · 2018-08-01T17:36:58Z

What changes were proposed in this pull request?

Regarding user-specified schema, data sources may have 3 different behaviors:

must have a user-specified schema
can't have a user-specified schema
can accept the user-specified if it's given, or infer the schema.

I added ReadSupportWithSchema to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a createReader(schema, options) to ReadSupport and make it call createReader(options) by default.

TODO: also fix the streaming API in followup PRs.

How was this patch tested?

existing tests.

holdensmagicalunicorn · 2018-08-01T17:37:01Z

@cloud-fan, thanks! I am a bot who has found some folks who might be able to help with the review:@gatorsmile, @zsxwing and @tdas

cloud-fan · 2018-08-01T17:39:12Z

cc @rxin @rdblue @jose-torres

jose-torres · 2018-08-01T17:41:15Z

Wouldn't the redo of the API that we're discussing obsolete this?

cloud-fan · 2018-08-01T17:45:14Z

In the new proposal, we just rename ReadSupport to BatchReadSupportProvider, so this change is kind of part of the big proposal.

rdblue · 2018-08-01T17:45:49Z

Isn't this unnecessary after the API redesign?

For the redesign, the DataSourceV2 or a ReadSupportProvider will supply a create method (or anonymousTable) to return a Table that implements ReadSupport. ReadSupport should not accept user schemas because the schema should be accessible from the Table itself. That way, we can use the same table-based relation (see https://github.com/apache/spark/pull/21877/files#diff-35ba4ffb5ccb9b18b43226f1d5effa23R82).

rdblue · 2018-08-01T17:51:30Z

@cloud-fan, from your comment around the same time as mine, it sounds like the confusion may just be in how you're updating the current API to the proposed one. Can you post a migration plan? It sounds like something like this:

ReadSupport and ReadSupportWithSchema -> BatchReadSupportProvider
DataSourceReader -> ReadSupport

Is that right? The re-use of ReadSupport would explain the confusion on my end.

cloud-fan · 2018-08-01T18:00:02Z

a ReadSupportProvider will supply a create method (or anonymousTable) to return a Table that implements ReadSupport...

I'd prefer the current proposal in https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?ts=5b613c42 : ReadSupportProvider#create returns ReadSupport.

Table supports both read and write, but we may want to allow read-only data sources.
the Table interface is still being developed, we can switch to it if we find it's more feasible later.

cloud-fan · 2018-08-01T18:04:57Z

@rdblue the plan is, I will have a big PR that implements the redesign. However, if there is something makes sense even without the redesign, we should have a separated PR. I think merging ReadSupport and ReadSupportWithSchema is the one.

cloud-fan · 2018-08-01T18:06:44Z

ReadSupport and ReadSupportWithSchema -> BatchReadSupportProvider
DataSourceReader -> ReadSupport

Yea, this is what I'm doing in my local branch for the redesign. I'll push it soon when it's finished.

rdblue · 2018-08-01T18:10:02Z

Yeah, I'm fine with this, then. It may be better to combine this with the other change, or to add the context to the description.

SparkQA · 2018-08-01T21:19:59Z

Test build #93888 has finished for PR 21946 at commit 19808d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-08-01T21:50:34Z

@rdblue This change is pretty isolated. It also LGTM to me.

Since you are fine about the change, I am assuming you are not blocking this. I will merge this soon.

rdblue · 2018-08-01T21:54:22Z

+1

SparkQA · 2018-08-01T21:54:49Z

Test build #93891 has finished for PR 21946 at commit 6cac2b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-01T22:10:54Z

Test build #93896 has finished for PR 21946 at commit 1f0c9a7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
sealed class Hasher[@specialized(Long, Int, Double, Float) T] extends Serializable
class DoubleHasher extends Hasher[Double]
class FloatHasher extends Hasher[Float]
case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike
case class ArrayExcept(left: Expression, right: Expression) extends ArraySetLike

SparkQA · 2018-08-01T22:45:15Z

Test build #93897 has finished for PR 21946 at commit 417930a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

Thanks! Merged to master.

Regarding user-specified schema, data sources may have 3 different behaviors: 1. must have a user-specified schema 2. can't have a user-specified schema 3. can accept the user-specified if it's given, or infer the schema. I added `ReadSupportWithSchema` to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a `createReader(schema, options)` to `ReadSupport` and make it call `createReader(options)` by default. TODO: also fix the streaming API in followup PRs. existing tests. Author: Wenchen Fan <[email protected]> Closes apache#21946 from cloud-fan/ds-schema. (cherry picked from commit ce084d3) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

Regarding user-specified schema, data sources may have 3 different behaviors: 1. must have a user-specified schema 2. can't have a user-specified schema 3. can accept the user-specified if it's given, or infer the schema. I added `ReadSupportWithSchema` to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a `createReader(schema, options)` to `ReadSupport` and make it call `createReader(options)` by default. TODO: also fix the streaming API in followup PRs. existing tests. Author: Wenchen Fan <[email protected]> Closes apache#21946 from cloud-fan/ds-schema.

Regarding user-specified schema, data sources may have 3 different behaviors: 1. must have a user-specified schema 2. can't have a user-specified schema 3. can accept the user-specified if it's given, or infer the schema. I added `ReadSupportWithSchema` to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a `createReader(schema, options)` to `ReadSupport` and make it call `createReader(options)` by default. TODO: also fix the streaming API in followup PRs. existing tests. Author: Wenchen Fan <[email protected]> Closes apache#21946 from cloud-fan/ds-schema. (cherry picked from commit ce084d3) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

merge ReadSupport and ReadSupportWithSchema

19808d5

cloud-fan added 3 commits August 2, 2018 02:11

address comments

6cac2b5

Merge branch 'master' into ds-schema

1f0c9a7

remove import

417930a

gatorsmile reviewed Aug 1, 2018

View reviewed changes

asfgit closed this in ce084d3 Aug 1, 2018

[SPARK-24990][SQL] merge ReadSupport and ReadSupportWithSchema #21946

[SPARK-24990][SQL] merge ReadSupport and ReadSupportWithSchema #21946

Uh oh!

Conversation

cloud-fan commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Aug 1, 2018

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

jose-torres commented Aug 1, 2018

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

rdblue commented Aug 1, 2018

Uh oh!

rdblue commented Aug 1, 2018

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

rdblue commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

gatorsmile commented Aug 1, 2018

Uh oh!

rdblue commented Aug 1, 2018

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Aug 1, 2018 •

edited

Loading

rdblue commented Aug 1, 2018 •

edited

Loading