Skip to content

Comments

[SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax#27482

Closed
imback82 wants to merge 3 commits intoapache:masterfrom
imback82:replace_cols
Closed

[SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax#27482
imback82 wants to merge 3 commits intoapache:masterfrom
imback82:replace_cols

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Feb 7, 2020

What changes were proposed in this pull request?

This PR proposes to support Hive-style ALTER TABLE ... REPLACE COLUMNS ... as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

The user now can do the following:

CREATE TABLE t (col1 int, col2 int) USING Foo;
ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3');

, which drops the existing columns col1 and col2, and add new columns col2 and col3.

Why are the changes needed?

This is a new DDL statement. Spark currently supports the Hive-style ALTER TABLE ... CHANGE COLUMN ..., so this new addition can be useful.

Does this PR introduce any user-facing change?

Yes, adding a new DDL statement.

How was this patch tested?

More tests to be added.

@imback82
Copy link
Contributor Author

imback82 commented Feb 7, 2020

@cloud-fan This is WIP, but I have a couple of questions.

REPLACE COLUMNS needs to drop all the existing columns, so I am creating Seq[TableChange] which has DeleteColumns followed by AddColumns.

  1. Can we assume that TableCatalog.alterTable() would apply the changes in the given order? (this is not documented).
  2. Since it needs to drop all the existing columns, we need to look up the table before creating AlterTable logical plan. What I currently have is to call loadTable in ResolveCatalogs, which may not be ideal since we will do two look ups (another in ResolveTables). Another way is to register a callback to AlterTable which can be called after table is resolved. What do you think?

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118009 has finished for PR 27482 at commit e9a71b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AlterTableReplaceColumnsStatement(

@cloud-fan
Copy link
Contributor

apply the changes in the given order?

This is a good point. I think we should, can you open a PR to improve the doc?

which may not be ideal since we will do two look ups

I think it's OK. We can clean it up later, thinking about how to resolve commands in general.

@imback82 imback82 changed the title [WIP][SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax [SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax Feb 12, 2020
@imback82
Copy link
Contributor Author

@cloud-fan this is now ready for review. Thanks!

@imback82
Copy link
Contributor Author

We currently have the following for ADD COLUMN

    | ALTER TABLE multipartIdentifier
        ADD (COLUMN | COLUMNS)
        columns=qualifiedColTypeWithPositionList                       #addTableColumns
    | ALTER TABLE multipartIdentifier
        ADD (COLUMN | COLUMNS)
        '(' columns=qualifiedColTypeWithPositionList ')'               #addTableColumns

But it seems that only the following is the sql standard:

    | ALTER TABLE multipartIdentifier
        ADD COLUMN?
        column=qualifiedColTypeWithPosition 

, and the following is Hive style:

    | ALTER TABLE multipartIdentifier
        ADD COLUMNS
        '(' columns=qualifiedColTypeWithPositionList ')' 

Should we fix this as well? (if so, we can combine hive style ADD and REPLACE grammar easily as well.)

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118275 has finished for PR 27482 at commit 30785f5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The problem is that we can't remove a SQL syntax that works in prior releases. Maybe we have to bear with it here.

withTable(t) {
sql(s"CREATE TABLE $t (col1 int, col2 int) USING $v2Format")
sql(s"ALTER TABLE $t REPLACE COLUMNS " +
"(col2 string COMMENT 'comment2', col3 int COMMENT 'comment3')")
Copy link
Contributor

@cloud-fan cloud-fan Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: if the col2 already has comment but we don't specify new comment in REPLACE COLUMNS, shall we retain the comment? What's the behavior of Hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of REPLACE COLUMNS is to drop all the existing columns first then add new columns. Thus, the comment will not be retained. I will update the test to reflect this.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118294 has finished for PR 27482 at commit 30785f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118313 has finished for PR 27482 at commit 9cc277d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@cloud-fan cloud-fan closed this in a6b4b91 Feb 13, 2020
}
}

val colsToDelete = mutable.Set.empty[Seq[String]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes conflicts when I backport #27584

I think the change in this file should go into 3.0 as well. Logically columns deleted should be skipped when checking name duplication for AddColumn.

@imback82 can you open a PR to backport #27584 with changes in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, working on it now!

cloud-fan pushed a commit that referenced this pull request Feb 19, 2020
… able to reference columns being added

(Backport of #27584 + partial #27482)

### What changes were proposed in this pull request?

In ALTER TABLE, a column in ADD COLUMNS can depend on the position of a column that is just being added. For example, for a table with the following schema:
```
root:
  - a: string
  - b: long
```
, the following should work:
```
ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x)
```
Currently, the above statement will throw an exception saying that AFTER x cannot be resolved, because x doesn't exist yet. This PR proposes to fix this issue.

### Why are the changes needed?

To fix a bug described above.

### Does this PR introduce any user-facing change?

Yes, now
```
ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x)
```
works as expected.

### How was this patch tested?

Added new tests

Closes #27624 from imback82/backport_27584.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This PR proposes to support Hive-style `ALTER TABLE ... REPLACE COLUMNS ...` as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

The user now can do the following:
```SQL
CREATE TABLE t (col1 int, col2 int) USING Foo;
ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3');
```
, which drops the existing columns `col1` and `col2`, and add new columns `col2` and `col3`.

### Why are the changes needed?

This is a new DDL statement. Spark currently supports the Hive-style `ALTER TABLE ... CHANGE COLUMN ...`, so this new addition can be useful.

### Does this PR introduce any user-facing change?

Yes, adding a new DDL statement.

### How was this patch tested?

More tests to be added.

Closes apache#27482 from imback82/replace_cols.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants