-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24882][SQL] improve data source v2 API #22009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
3ce6b7c
220976b
0b3f7a0
cab6d28
2fc3b05
29b4f33
c224999
b062220
f620297
063fe27
2b1c22a
c4b5469
2fa12d7
6728d33
7c21af9
c175be4
f4f85a8
ff2ed26
4545ed2
e6e599a
2018981
76f8e6b
844bd6f
9acda35
ca80080
f938614
8833b67
51cda76
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,15 +28,16 @@ | |
| * A mix-in interface for {@link DataSourceV2}. Data sources can implement this interface to | ||
| * provide data writing ability for batch processing. | ||
| * | ||
| * This interface is used when end users want to use a data source implementation directly, e.g. | ||
| * This interface is used to return {@link BatchWriteSupport} instances when end users run | ||
| * {@code Dataset.write.format(...).option(...).save()}. | ||
| */ | ||
| @InterfaceStability.Evolving | ||
| public interface BatchWriteSupportProvider extends DataSourceV2 { | ||
|
|
||
| /** | ||
| * Creates an optional {@link BatchWriteSupport} to save the data to this data source. Data | ||
| * sources can return None if there is no writing needed to be done according to the save mode. | ||
| * Returns an optional {@link BatchWriteSupport} instance to save the data to this data source. | ||
| * Data sources can return None if there is no writing needed to be done according to the save | ||
| * mode. | ||
| * | ||
| * @param queryId A unique string for the writing query. It's possible that there are many | ||
| * writing queries running at the same time, and the returned | ||
|
|
@@ -48,7 +49,7 @@ public interface BatchWriteSupportProvider extends DataSourceV2 { | |
| * case-insensitive string-to-string map. | ||
| * @return a write support to write data to this data source. | ||
| */ | ||
| Optional<BatchWriteSupport> createBatchWriteSupport( | ||
| Optional<BatchWriteSupport> getBatchWriteSupport( | ||
| String queryId, | ||
| StructType schema, | ||
| SaveMode mode, | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue I'd like to keep the save mode before we finish the new DDL logical plans and DataFrameWriter APIs. We are migrating file source and we still need to support After we finish the new DDL logical plans and DataFrameWriter APIs, we can rename it to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is a good idea. Why introduce a legacy API into a new API? If we are moving old sources to the new API, then they should fully implement the new API and should not continue to expose the unpredictable v1 behavior. That said, as long as the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem here is, if we don't take I hope we can get this in before Spark 2.4, so that some data source projects can start migrating and experimenting.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan: this is a bad idea because it enables save modes other than append for DataSourceV2 without using the new logical plans. This leads to undefined behavior and is why we proposed standard logical plans in the first place. Using a data source implementation directly should only support appending and scanning anything more complex must require I'm -1 on this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify, your proposal is that we should block the completion of DataSourceV2 until the new logical plans are in place?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the v2 sources should only use the plans proposed in the SPIP. I also think that the v2 data source API should always tell the data source exactly what to do: for overwrite, what should be deleted and what should be added. That doesn't block fixing the v2 API here and doesn't prevent anyone from using it. But it would prevent people from relying on undefined behavior that results from passing an ambiguous The only thing that would not be available by doing this is overwrite support by using the SaveMode, which isn't something anyone should rely on because it doesn't have defined behavior. I understand that this may seem like it would block migration from the v1 API to the v2 API. But I think it is the right thing to do so that we have a clear and consistent definition for how v2 sources behave.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I totally agree that |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,16 +24,17 @@ | |
| import org.apache.spark.sql.sources.v2.reader.ScanConfigBuilder; | ||
|
|
||
| /** | ||
| * An interface that defines how to scan the data from data source for continuous streaming | ||
| * An interface that defines how to load the data from data source for continuous streaming | ||
| * processing. | ||
| * | ||
| * The execution engine will create an instance of this interface at the start of a streaming query, | ||
| * then call {@link #newScanConfigBuilder(Offset)} and create an instance of {@link ScanConfig} for | ||
| * the duration of the streaming query or until {@link #needsReconfiguration(ScanConfig)} is true. | ||
| * The {@link ScanConfig} will be used to create input partitions and reader factory to process data | ||
| * for its duration. At the end {@link #stop()} will be called when the streaming execution is | ||
| * completed. Note that a single query may have multiple executions due to restart or failure | ||
| * recovery. | ||
| * The execution engine will get an instance of this interface from a data source provider | ||
| * (e.g. {@link org.apache.spark.sql.sources.v2.ContinuousReadSupportProvider}) at the start of a | ||
| * streaming query, then call {@link #newScanConfigBuilder(Offset)} to create an instance of | ||
| * {@link ScanConfig} for the duration of the streaming query or until | ||
| * {@link #needsReconfiguration(ScanConfig)} is true. The {@link ScanConfig} will be used to create | ||
| * input partitions and reader factory to scan data for its duration. At the end {@link #stop()} | ||
| * will be called when the streaming execution is completed. Note that a single query may have | ||
| * multiple executions due to restart or failure recovery. | ||
|
||
| */ | ||
| @InterfaceStability.Evolving | ||
| public interface ContinuousReadSupport extends StreamingReadSupport, BaseStreamingSource { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this interface (and the continuous and micro-batch equivalents) should note that returning a
ReadSupportfrom options is for sources with no catalog support or to use an implementation directly. Maybe we should add this after #21306 is in though. What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I'll add some documents about when this interface will be used (spark.read.format...), which means to use an implementation directly.