Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Aug 1, 2018

What changes were proposed in this pull request?

Regarding user-specified schema, data sources may have 3 different behaviors:

  1. must have a user-specified schema
  2. can't have a user-specified schema
  3. can accept the user-specified if it's given, or infer the schema.

I added ReadSupportWithSchema to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a createReader(schema, options) to ReadSupport and make it call createReader(options) by default.

TODO: also fix the streaming API in followup PRs.

How was this patch tested?

existing tests.

@holdensmagicalunicorn
Copy link

@cloud-fan, thanks! I am a bot who has found some folks who might be able to help with the review:@gatorsmile, @zsxwing and @tdas

@cloud-fan
Copy link
Contributor Author

cc @rxin @rdblue @jose-torres

@jose-torres
Copy link
Contributor

Wouldn't the redo of the API that we're discussing obsolete this?

@cloud-fan
Copy link
Contributor Author

In the new proposal, we just rename ReadSupport to BatchReadSupportProvider, so this change is kind of part of the big proposal.

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

Isn't this unnecessary after the API redesign?

For the redesign, the DataSourceV2 or a ReadSupportProvider will supply a create method (or anonymousTable) to return a Table that implements ReadSupport. ReadSupport should not accept user schemas because the schema should be accessible from the Table itself. That way, we can use the same table-based relation (see https://github.com/apache/spark/pull/21877/files#diff-35ba4ffb5ccb9b18b43226f1d5effa23R82).

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

@cloud-fan, from your comment around the same time as mine, it sounds like the confusion may just be in how you're updating the current API to the proposed one. Can you post a migration plan? It sounds like something like this:

ReadSupport and ReadSupportWithSchema -> BatchReadSupportProvider
DataSourceReader -> ReadSupport

Is that right? The re-use of ReadSupport would explain the confusion on my end.

@cloud-fan
Copy link
Contributor Author

a ReadSupportProvider will supply a create method (or anonymousTable) to return a Table that implements ReadSupport...

I'd prefer the current proposal in https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?ts=5b613c42 : ReadSupportProvider#create returns ReadSupport.

  1. Table supports both read and write, but we may want to allow read-only data sources.
  2. the Table interface is still being developed, we can switch to it if we find it's more feasible later.

@cloud-fan
Copy link
Contributor Author

@rdblue the plan is, I will have a big PR that implements the redesign. However, if there is something makes sense even without the redesign, we should have a separated PR. I think merging ReadSupport and ReadSupportWithSchema is the one.

@cloud-fan
Copy link
Contributor Author

ReadSupport and ReadSupportWithSchema -> BatchReadSupportProvider
DataSourceReader -> ReadSupport

Yea, this is what I'm doing in my local branch for the redesign. I'll push it soon when it's finished.

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

Yeah, I'm fine with this, then. It may be better to combine this with the other change, or to add the context to the description.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93888 has finished for PR 21946 at commit 19808d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@rdblue This change is pretty isolated. It also LGTM to me.

Since you are fine about the change, I am assuming you are not blocking this. I will merge this soon.

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

+1

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93891 has finished for PR 21946 at commit 6cac2b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93896 has finished for PR 21946 at commit 1f0c9a7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](
  • sealed class Hasher[@specialized(Long, Int, Double, Float) T] extends Serializable
  • class DoubleHasher extends Hasher[Double]
  • class FloatHasher extends Hasher[Float]
  • case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike
  • case class ArrayExcept(left: Expression, right: Expression) extends ArraySetLike

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93897 has finished for PR 21946 at commit 417930a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Merged to master.

@asfgit asfgit closed this in ce084d3 Aug 1, 2018
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
Regarding user-specified schema, data sources may have 3 different behaviors:
1. must have a user-specified schema
2. can't have a user-specified schema
3. can accept the user-specified if it's given, or infer the schema.

I added `ReadSupportWithSchema` to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a `createReader(schema, options)` to `ReadSupport` and make it call `createReader(options)` by default.

TODO: also fix the streaming API in followup PRs.

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#21946 from cloud-fan/ds-schema.

(cherry picked from commit ce084d3)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
	sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
Regarding user-specified schema, data sources may have 3 different behaviors:
1. must have a user-specified schema
2. can't have a user-specified schema
3. can accept the user-specified if it's given, or infer the schema.

I added `ReadSupportWithSchema` to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a `createReader(schema, options)` to `ReadSupport` and make it call `createReader(options)` by default.

TODO: also fix the streaming API in followup PRs.

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#21946 from cloud-fan/ds-schema.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
Regarding user-specified schema, data sources may have 3 different behaviors:
1. must have a user-specified schema
2. can't have a user-specified schema
3. can accept the user-specified if it's given, or infer the schema.

I added `ReadSupportWithSchema` to support these behaviors, following data source v1. But it turns out we don't need this extra interface. We can just add a `createReader(schema, options)` to `ReadSupport` and make it call `createReader(options)` by default.

TODO: also fix the streaming API in followup PRs.

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#21946 from cloud-fan/ds-schema.

(cherry picked from commit ce084d3)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
	sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants