[SPARK-29908][SQL] Alternative proposal for supporting partitioning through save for V2 tables by brkyvz · Pull Request #25833 · apache/spark

brkyvz · 2019-09-18T16:42:38Z

What changes were proposed in this pull request?

This is an alternative proposal to #25822 and #25651. The problem we're trying to solve is that when a catalog doesn't exist when using a data source, there is no good way to create a V2 table with partitioning and table property information. Spark users have been using data source options to connect to such data sources such as Kafka, JDBC tables through data source options, and it should be possible to continue to create tables as such.

This PR introduces a couple interfaces: SupportsCreateTable and SupportsIdentifierTranslation. SupportsCreateTable are the parts that existed in TableCatalog that are related to the creation/dropping of tables. This is pulled out, and TableCatalog extends this interface. SupportsIdentifierTranslation is a way for data sources to go from data source options to an internal identifier that can be used to describe how to access that table. A TableProvider can extend SupportsIdentifierTranslation and SupportsCreateTable to be able to support the creation of tables without requiring an explicit catalog.

This would:

Fix the behavior for DataFrameWriter.save when passing in partitioning information to data sources
Allow ErrorIfExists and Ignore to be supported for DataFrameWriter.save
Open the path for supporting path based tables in DataFrameWriterV2

Why are the changes needed?

DataFrameWriter.save is broken for all data sources that want to get partitioning information and support different SaveModes that migrate from DataSource V1 to V2 APIs.

Does this PR introduce any user-facing change?

The behavior of a DataSource that used to be DataSource V1 in Spark 2.4 can behave identically with DataSource V2 in Spark 3.0.

How was this patch tested?

Will add tests after comments

SparkQA · 2019-09-18T16:56:59Z

Test build #110927 has finished for PR 25833 at commit e48a5c4.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait FileDataSourceV2 extends SupportIdentifierTranslation with DataSourceRegister

dongjoon-hyun · 2019-09-18T17:49:22Z

Hi, @brkyvz . This seems to break the compilation. Could you take a look?

brkyvz · 2019-09-18T19:00:49Z

cc @jose-torres @cloud-fan @dbtsai @rdblue

brkyvz · 2019-09-18T19:15:14Z

Another option is that a V2 DataSource doesn't need to extend TableProvider for CreateTable and stuff to go through the V2SessionCatalog, and a DataSource can continue to re-use it's V1 APIs.

rdblue · 2019-09-18T22:45:11Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+                extraOptions.toMap,
+                orCreate = true)      // Create the table if it doesn't exist
+
+            case (other, _) =>


Why not use AppendData when mode is append?

rdblue · 2019-09-18T22:45:37Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

 */
 @Experimental
-public interface TableCatalog extends CatalogPlugin {
+public interface TableCatalog extends CatalogPlugin, SupportCreateTable {


Why do tables managed by a TableProvider not require invalidateTable?

rdblue · 2019-09-18T22:47:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

+    } else if (paths.isEmpty) {
+      throw new IllegalArgumentException("Didn't specify the 'path' for file based table")
+    }
+    Identifier.of(Array.empty, paths.head)


This should be a different class, PathIdentifier, so that we can easily identify these and handle them separately.

let's see

e48a5c4

brkyvz mentioned this pull request Sep 18, 2019

[SPARK-29908][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables #25822

Closed

rdblue reviewed Sep 18, 2019

View reviewed changes

dongjoon-hyun added the SQL label Sep 19, 2019

brkyvz closed this Nov 11, 2019

dongjoon-hyun changed the title ~~[SPARK-29127][SQL] Alternative proposal for supporting partitioning through save for V2 tables~~ [SPARK-29908][SQL] Alternative proposal for supporting partitioning through save for V2 tables Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-29908][SQL] Alternative proposal for supporting partitioning through save for V2 tables#25833

[SPARK-29908][SQL] Alternative proposal for supporting partitioning through save for V2 tables#25833
brkyvz wants to merge 1 commit intoapache:masterfrom
brkyvz:radicalV2

brkyvz commented Sep 18, 2019 •

edited

Loading

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

dongjoon-hyun commented Sep 18, 2019

Uh oh!

brkyvz commented Sep 18, 2019

Uh oh!

brkyvz commented Sep 18, 2019

Uh oh!

rdblue Sep 18, 2019

Uh oh!

rdblue Sep 18, 2019

Uh oh!

rdblue Sep 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

brkyvz commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

dongjoon-hyun commented Sep 18, 2019

Uh oh!

brkyvz commented Sep 18, 2019

Uh oh!

brkyvz commented Sep 18, 2019

Uh oh!

rdblue Sep 18, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 18, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 18, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brkyvz commented Sep 18, 2019 •

edited

Loading