Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 23, 2016

What changes were proposed in this pull request?

The CreateDataSourceTableAsSelectCommand is quite complex now, as it has a lot of work to do if the table already exists:

  1. throw exception if we don't want to ignore it.
  2. do some check and adjust the schema if we want to append data.
  3. drop the table and create it again if we want to overwrite.

The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

The first commit is from another PR and you can ignore it. Do you think we should target this ticket to 2.1? It's kind of a refactor but do fix some problems.

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69084 has finished for PR 15996 at commit 7f90a10.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change, I like it! Now it is identical to the interface CreateHiveTableAsSelectCommand. Maybe we can copy the params here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This todo is still valid?

@gatorsmile
Copy link
Member

A general issue: after we moving the execution of Append and Overwrite into DataFrameWriter, the verification in AnalyzeCreateTable is not called. Some logics are still required.

@cloud-fan cloud-fan changed the title [SPARK-18567][SQL][WIP] Simplify CreateDataSourceTableAsSelectCommand [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand Dec 12, 2016
@cloud-fan
Copy link
Contributor Author

@gatorsmile , can you point out which verification we need to add back from AnalyzeCreateTable?

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #70026 has finished for PR 15996 at commit 89f148b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70061 has started for PR 15996 at commit 172f6eb.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70074 has finished for PR 15996 at commit 172f6eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

throw new AnalysisException(
s"The column number of the existing schema[$existingSchema] " +
s"doesn't match the data schema[${df.logicalPlan.schema}]")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, this fixes a bug. Before this PR, we only check the size when the target is the LogicalRelation.

// Because we are inserting into an existing table, we should respect the existing
// schema and adjust columns order of the given dataframe according to it.
df.select(existingSchema.map(f => Column(f.name)): _*)
.write.insertInto(tableIdentWithDB)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's ok to analyze twice, but not analyze an optimized plan, let me look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I made the mistake here. I deleted the comment after I realized it.

assertNotBucketed("insertInto") is missing here. This is an existing bug, right?

@cloud-fan cloud-fan force-pushed the append branch 2 times, most recently from 1efb892 to 6c64007 Compare December 15, 2016 12:01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 checkings are newly added, previously we ignore the user specified partition columns and bucket silently, now we will log a warning message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted #15983 here because it's not needed anymore after this refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part of that pr is reverted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of it, except the test

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70187 has finished for PR 15996 at commit 1efb892.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70188 has finished for PR 15996 at commit 6c64007.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70305 has finished for PR 15996 at commit 4178112.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Could we update the PR description and add the test case in PartitionProviderCompatibilitySuite.scala to reflect the external behavior changes of CTAS on partitioned data source tables?

@cloud-fan cloud-fan force-pushed the append branch 2 times, most recently from 28f88ef to 97dc307 Compare December 20, 2016 14:19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, we always go to createRelation, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only 2 data sources accept saveAsTable with append mode: CreatableRelationProvider and FileFormat. For CreatableRelationProvider, we always go to createRelation, for FileFormat, we go to InsertIntoHadoopFsRelation, which is same with InsertIntoTable. That's why I add the if-else here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we always use SaveMode.Overwrite at here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are creating a new table, the data dir is empty, ideally we can use whatever mode. Maybe use ErrorIfExists is safer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we do not need this anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, we are checking the number of rows before the msck, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also explain why we only see newly written partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the behavior of InsertItoTable. I'll add that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to also explain the reason that we use (3, 13) in the comment.

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70532 has finished for PR 15996 at commit 9a1ad71.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment. Let's explicitly say that we want to test the case that a data source is a CreatableRelationProvider but its relation does not implement InsertableRelation.

@yhuai
Copy link
Contributor

yhuai commented Dec 23, 2016

LGTM pending jenkins. Can you update the comment to address my last comment (#15996 (comment))?

@yhuai
Copy link
Contributor

yhuai commented Dec 23, 2016

ah 9a1ad71 failed.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70633 has finished for PR 15996 at commit be9e7b5.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70634 has finished for PR 15996 at commit 3a610ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70637 has finished for PR 15996 at commit 3d14939.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70654 has finished for PR 15996 at commit 7f8d8c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

tableLocation: Option[String],
data: LogicalPlan,
mode: SaveMode): BaseRelation = {
// Create the relation based on the data of df.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Need an update in the comment.

EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {
// Only do the check if the table is a data source table (the relation is a BaseRelation).
case LogicalRelation(dest: BaseRelation, _, _) =>
if (srcRelations.contains(dest)) {
Copy link
Member

@gatorsmile gatorsmile Dec 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

          case LogicalRelation(dest: BaseRelation, _, _) if srcRelations.contains(dest) =>
            throw new AnalysisException(...

case SaveMode.Append =>
val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB)
val result = if (sessionState.catalog.tableExists(tableIdentWithDB)) {
assert(mode != SaveMode.Overwrite, "analyzer will drop the table to overwrite it.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about s"Expect the table $tableName has been dropped when the save mode is Overwrite"?

@gatorsmile
Copy link
Member

LGTM except three minor comments

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70672 has finished for PR 15996 at commit d8f31f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@yhuai
Copy link
Contributor

yhuai commented Dec 29, 2016

LGTM. Merging to master.

@asfgit asfgit closed this in 7d19b6a Dec 29, 2016
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 30, 2016
## What changes were proposed in this pull request?

The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists:

1. throw exception if we don't want to ignore it.
2. do some check and adjust the schema if we want to append data.
3. drop the table and create it again if we want to overwrite.

The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#15996 from cloud-fan/append.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists:

1. throw exception if we don't want to ignore it.
2. do some check and adjust the schema if we want to append data.
3. drop the table and create it again if we want to overwrite.

The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#15996 from cloud-fan/append.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants