[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand #15996

cloud-fan · 2016-11-23T18:04:50Z

What changes were proposed in this pull request?

The CreateDataSourceTableAsSelectCommand is quite complex now, as it has a lot of work to do if the table already exists:

throw exception if we don't want to ignore it.
do some check and adjust the schema if we want to append data.
drop the table and create it again if we want to overwrite.

The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables.

How was this patch tested?

existing tests.

cloud-fan · 2016-11-23T18:06:38Z

cc @yhuai @gatorsmile

The first commit is from another PR and you can ignore it. Do you think we should target this ticket to 2.1? It's kind of a refactor but do fix some problems.

SparkQA · 2016-11-23T18:09:17Z

Test build #69084 has finished for PR 15996 at commit 7f90a10.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-23T22:21:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

This is a good change, I like it! Now it is identical to the interface CreateHiveTableAsSelectCommand. Maybe we can copy the params here.

gatorsmile · 2016-11-23T22:44:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

This todo is still valid?

gatorsmile · 2016-11-23T22:54:44Z

A general issue: after we moving the execution of Append and Overwrite into DataFrameWriter, the verification in AnalyzeCreateTable is not called. Some logics are still required.

cloud-fan · 2016-12-12T15:53:05Z

@gatorsmile , can you point out which verification we need to add back from AnalyzeCreateTable?

SparkQA · 2016-12-12T17:38:47Z

Test build #70026 has finished for PR 15996 at commit 89f148b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T05:42:39Z

Test build #70061 has started for PR 15996 at commit 172f6eb.

cloud-fan · 2016-12-13T08:15:04Z

retest this please

SparkQA · 2016-12-13T10:53:10Z

Test build #70074 has finished for PR 15996 at commit 172f6eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-13T20:06:52Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+              throw new AnalysisException(
+                s"The column number of the existing schema[$existingSchema] " +
+                  s"doesn't match the data schema[${df.logicalPlan.schema}]")
+            }


uh, this fixes a bug. Before this PR, we only check the size when the target is the LogicalRelation.

cloud-fan · 2016-12-15T01:13:27Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+            // Because we are inserting into an existing table, we should respect the existing
+            // schema and adjust columns order of the given dataframe according to it.
+            df.select(existingSchema.map(f => Column(f.name)): _*)
+              .write.insertInto(tableIdentWithDB)


I thought it's ok to analyze twice, but not analyze an optimized plan, let me look into it.

Sorry, I made the mistake here. I deleted the comment after I realized it.

assertNotBucketed("insertInto") is missing here. This is an existing bug, right?

cloud-fan · 2016-12-15T12:09:52Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

This is mostly moved from https://github.com/apache/spark/pull/15996/files#diff-945e51801b84b92da242fcb42f83f5f5L148

cloud-fan · 2016-12-15T12:12:46Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

These 2 checkings are newly added, previously we ignore the user specified partition columns and bucket silently, now we will log a warning message.

cloud-fan · 2016-12-15T12:13:47Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

This is mostly moved from https://github.com/apache/spark/pull/15996/files#diff-73bd90660f41c12a87ee9fe8d35d856aL314

cloud-fan · 2016-12-15T12:20:07Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

I reverted #15983 here because it's not needed anymore after this refactor.

Which part of that pr is reverted?

all of it, except the test

SparkQA · 2016-12-15T13:20:05Z

Test build #70187 has finished for PR 15996 at commit 1efb892.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-15T13:26:55Z

Test build #70188 has finished for PR 15996 at commit 6c64007.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-17T13:23:38Z

Test build #70305 has finished for PR 15996 at commit 4178112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-20T01:00:52Z

Could we update the PR description and add the test case in PartitionProviderCompatibilitySuite.scala to reflect the external behavior changes of CTAS on partitioned data source tables?

yhuai · 2016-12-23T00:54:46Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Before this change, we always go to createRelation, right?

Currently only 2 data sources accept saveAsTable with append mode: CreatableRelationProvider and FileFormat. For CreatableRelationProvider, we always go to createRelation, for FileFormat, we go to InsertIntoHadoopFsRelation, which is same with InsertIntoTable. That's why I add the if-else here.

yhuai · 2016-12-23T01:00:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

Can you explain why we always use SaveMode.Overwrite at here?

We are creating a new table, the data dir is empty, ideally we can use whatever mode. Maybe use ErrorIfExists is safer?

yhuai · 2016-12-23T01:01:52Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

Why we do not need this anymore?

oh, we are checking the number of rows before the msck, right?

yhuai · 2016-12-23T01:03:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

Let's also explain why we only see newly written partitions.

to be consistent with the behavior of InsertItoTable. I'll add that.

yhuai · 2016-12-23T01:03:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

It will be good to also explain the reason that we use (3, 13) in the comment.

SparkQA · 2016-12-23T04:22:34Z

Test build #70532 has finished for PR 15996 at commit 9a1ad71.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-12-23T05:15:36Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

One last comment. Let's explicitly say that we want to test the case that a data source is a CreatableRelationProvider but its relation does not implement InsertableRelation.

yhuai · 2016-12-23T05:16:05Z

LGTM pending jenkins. Can you update the comment to address my last comment (#15996 (comment))?

yhuai · 2016-12-23T05:45:13Z

ah 9a1ad71 failed.

SparkQA · 2016-12-27T12:13:36Z

Test build #70633 has finished for PR 15996 at commit be9e7b5.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-27T13:43:49Z

Test build #70634 has finished for PR 15996 at commit 3a610ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-27T17:18:12Z

Test build #70637 has finished for PR 15996 at commit 3d14939.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-28T07:45:10Z

Test build #70654 has finished for PR 15996 at commit 7f8d8c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-28T08:17:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+      tableLocation: Option[String],
+      data: LogicalPlan,
+      mode: SaveMode): BaseRelation = {
+    // Create the relation based on the data of df.


Nit: Need an update in the comment.

gatorsmile · 2016-12-28T08:29:59Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+        EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {
+          // Only do the check if the table is a data source table (the relation is a BaseRelation).
+          case LogicalRelation(dest: BaseRelation, _, _) =>
+            if (srcRelations.contains(dest)) {


Nit:

case LogicalRelation(dest: BaseRelation, _, _) if srcRelations.contains(dest) => throw new AnalysisException(...

gatorsmile · 2016-12-28T08:55:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

-        case SaveMode.Append =>
-          val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB)
+    val result = if (sessionState.catalog.tableExists(tableIdentWithDB)) {
+      assert(mode != SaveMode.Overwrite, "analyzer will drop the table to overwrite it.")


How about s"Expect the table $tableName has been dropped when the save mode is Overwrite"?

gatorsmile · 2016-12-28T09:12:31Z

LGTM except three minor comments

SparkQA · 2016-12-28T17:00:19Z

Test build #70672 has finished for PR 15996 at commit d8f31f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-28T17:30:39Z

LGTM

yhuai · 2016-12-29T05:49:39Z

LGTM. Merging to master.

## What changes were proposed in this pull request? The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists: 1. throw exception if we don't want to ignore it. 2. do some check and adjust the schema if we want to append data. 3. drop the table and create it again if we want to overwrite. The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#15996 from cloud-fan/append.

gatorsmile reviewed Nov 23, 2016

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala Outdated

Copy link

Member

gatorsmile Nov 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This todo is still valid?

cloud-fan force-pushed the append branch from 7f90a10 to 89f148b Compare December 12, 2016 15:50

cloud-fan changed the title ~~[SPARK-18567][SQL][WIP] Simplify CreateDataSourceTableAsSelectCommand~~ [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand Dec 12, 2016

cloud-fan force-pushed the append branch from 89f148b to 172f6eb Compare December 13, 2016 05:39

cloud-fan mentioned this pull request Dec 13, 2016

[SPARK-19080][SQL] simplify data source analysis #16269

Closed

gatorsmile reviewed Dec 13, 2016

View reviewed changes

cloud-fan commented Dec 15, 2016

View reviewed changes

cloud-fan force-pushed the append branch 2 times, most recently from 1efb892 to 6c64007 Compare December 15, 2016 12:01

cloud-fan commented Dec 15, 2016

View reviewed changes

cloud-fan force-pushed the append branch from 6c64007 to 4178112 Compare December 17, 2016 11:01

cloud-fan mentioned this pull request Dec 19, 2016

[SPARK-18544] [SQL] Append with df.saveAsTable writes data to wrong location #15983

Closed

cloud-fan force-pushed the append branch 2 times, most recently from 28f88ef to 97dc307 Compare December 20, 2016 14:19

yhuai reviewed Dec 23, 2016

View reviewed changes

cloud-fan force-pushed the append branch from 9a1ad71 to be9e7b5 Compare December 27, 2016 12:10

cloud-fan force-pushed the append branch from be9e7b5 to 3a610ea Compare December 27, 2016 12:20

Simplify CreateDataSourceTableAsSelectCommand

3d14939

cloud-fan force-pushed the append branch from 3a610ea to 3d14939 Compare December 27, 2016 15:03

address comments

7f8d8c9

gatorsmile reviewed Dec 28, 2016

View reviewed changes

address comment

d8f31f1

asfgit closed this in 7d19b6a Dec 29, 2016

[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand #15996

[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand #15996

Uh oh!

Conversation

cloud-fan commented Nov 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Nov 23, 2016

Uh oh!

SparkQA commented Nov 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Nov 23, 2016

Uh oh!

cloud-fan commented Dec 12, 2016

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

cloud-fan commented Dec 13, 2016

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 17, 2016

Uh oh!

gatorsmile commented Dec 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

cloud-fan commented Nov 23, 2016 •

edited

Loading

yhuai commented Dec 23, 2016 •

edited

Loading

gatorsmile Dec 28, 2016 •

edited

Loading