[SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax by imback82 · Pull Request #27482 · apache/spark

imback82 · 2020-02-07T03:33:28Z

What changes were proposed in this pull request?

This PR proposes to support Hive-style ALTER TABLE ... REPLACE COLUMNS ... as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

The user now can do the following:

CREATE TABLE t (col1 int, col2 int) USING Foo;
ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3');

, which drops the existing columns col1 and col2, and add new columns col2 and col3.

Why are the changes needed?

This is a new DDL statement. Spark currently supports the Hive-style ALTER TABLE ... CHANGE COLUMN ..., so this new addition can be useful.

Does this PR introduce any user-facing change?

Yes, adding a new DDL statement.

How was this patch tested?

More tests to be added.

imback82 · 2020-02-07T03:52:00Z

@cloud-fan This is WIP, but I have a couple of questions.

REPLACE COLUMNS needs to drop all the existing columns, so I am creating Seq[TableChange] which has DeleteColumns followed by AddColumns.

Can we assume that TableCatalog.alterTable() would apply the changes in the given order? (this is not documented).
Since it needs to drop all the existing columns, we need to look up the table before creating AlterTable logical plan. What I currently have is to call loadTable in ResolveCatalogs, which may not be ideal since we will do two look ups (another in ResolveTables). Another way is to register a callback to AlterTable which can be called after table is resolved. What do you think?

SparkQA · 2020-02-07T07:52:27Z

Test build #118009 has finished for PR 27482 at commit e9a71b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AlterTableReplaceColumnsStatement(

cloud-fan · 2020-02-07T07:54:18Z

apply the changes in the given order?

This is a good point. I think we should, can you open a PR to improve the doc?

which may not be ideal since we will do two look ups

I think it's OK. We can clean it up later, thinking about how to resolve commands in general.

imback82 · 2020-02-12T02:06:36Z

@cloud-fan this is now ready for review. Thanks!

imback82 · 2020-02-12T02:26:48Z

We currently have the following for ADD COLUMN

    | ALTER TABLE multipartIdentifier
        ADD (COLUMN | COLUMNS)
        columns=qualifiedColTypeWithPositionList                       #addTableColumns
    | ALTER TABLE multipartIdentifier
        ADD (COLUMN | COLUMNS)
        '(' columns=qualifiedColTypeWithPositionList ')'               #addTableColumns

But it seems that only the following is the sql standard:

    | ALTER TABLE multipartIdentifier
        ADD COLUMN?
        column=qualifiedColTypeWithPosition

, and the following is Hive style:

    | ALTER TABLE multipartIdentifier
        ADD COLUMNS
        '(' columns=qualifiedColTypeWithPositionList ')'

Should we fix this as well? (if so, we can combine hive style ADD and REPLACE grammar easily as well.)

SparkQA · 2020-02-12T08:05:02Z

Test build #118275 has finished for PR 27482 at commit 30785f5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-12T08:41:07Z

The problem is that we can't remove a SQL syntax that works in prior releases. Maybe we have to bear with it here.

cloud-fan · 2020-02-12T10:14:38Z

sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala

+    withTable(t) {
+      sql(s"CREATE TABLE $t (col1 int, col2 int) USING $v2Format")
+      sql(s"ALTER TABLE $t REPLACE COLUMNS " +
+        "(col2 string COMMENT 'comment2', col3 int COMMENT 'comment3')")


One question: if the col2 already has comment but we don't specify new comment in REPLACE COLUMNS, shall we retain the comment? What's the behavior of Hive?

The behavior of REPLACE COLUMNS is to drop all the existing columns first then add new columns. Thus, the comment will not be retained. I will update the test to reflect this.

cloud-fan · 2020-02-12T10:15:37Z

retest this please

SparkQA · 2020-02-12T14:52:31Z

Test build #118294 has finished for PR 27482 at commit 30785f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-12T21:50:08Z

Test build #118313 has finished for PR 27482 at commit 9cc277d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-13T12:13:52Z

LGTM, merging to master!

cloud-fan · 2020-02-18T05:12:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

              }
            }

+            val colsToDelete = mutable.Set.empty[Seq[String]]


This causes conflicts when I backport #27584

I think the change in this file should go into 3.0 as well. Logically columns deleted should be skipped when checking name duplication for AddColumn.

@imback82 can you open a PR to backport #27584 with changes in this file?

Yes, working on it now!

… able to reference columns being added (Backport of #27584 + partial #27482) ### What changes were proposed in this pull request? In ALTER TABLE, a column in ADD COLUMNS can depend on the position of a column that is just being added. For example, for a table with the following schema: ``` root: - a: string - b: long ``` , the following should work: ``` ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x) ``` Currently, the above statement will throw an exception saying that AFTER x cannot be resolved, because x doesn't exist yet. This PR proposes to fix this issue. ### Why are the changes needed? To fix a bug described above. ### Does this PR introduce any user-facing change? Yes, now ``` ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x) ``` works as expected. ### How was this patch tested? Added new tests Closes #27624 from imback82/backport_27584. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR proposes to support Hive-style `ALTER TABLE ... REPLACE COLUMNS ...` as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns The user now can do the following: ```SQL CREATE TABLE t (col1 int, col2 int) USING Foo; ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3'); ``` , which drops the existing columns `col1` and `col2`, and add new columns `col2` and `col3`. ### Why are the changes needed? This is a new DDL statement. Spark currently supports the Hive-style `ALTER TABLE ... CHANGE COLUMN ...`, so this new addition can be useful. ### Does this PR introduce any user-facing change? Yes, adding a new DDL statement. ### How was this patch tested? More tests to be added. Closes apache#27482 from imback82/replace_cols. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

initial commit

e9a71b5

dongjoon-hyun added the SQL label Feb 10, 2020

add more tests

30785f5

imback82 changed the title ~~[WIP][SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax~~ [SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax Feb 12, 2020

cloud-fan reviewed Feb 12, 2020

View reviewed changes

update test

9cc277d

cloud-fan closed this in a6b4b91 Feb 13, 2020

cloud-fan reviewed Feb 18, 2020

View reviewed changes

imback82 mentioned this pull request Feb 18, 2020

[SPARK-30814][SQL][3.0] ALTER TABLE ... ADD COLUMN position should be able to reference columns being added #27624

Closed

Comments

Conversation

imback82 commented Feb 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 commented Feb 7, 2020

Uh oh!

SparkQA commented Feb 7, 2020

Uh oh!

cloud-fan commented Feb 7, 2020

Uh oh!

imback82 commented Feb 12, 2020

Uh oh!

imback82 commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

cloud-fan commented Feb 12, 2020

Uh oh!

cloud-fan Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 Feb 12, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

cloud-fan commented Feb 13, 2020

Uh oh!

cloud-fan Feb 18, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Feb 18, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan Feb 12, 2020 •

edited

Loading