[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns #24780

emanuelebardelli · 2019-06-03T12:22:32Z

What changes were proposed in this pull request?

Add support for ALTER TABLE REPLACE COLUMNS

drop column if not preset in the new schema
add column if not preset in the table’s schema
keep column if column.name and column.dataType match between table and new schemas
raise exception if trying to replace an existing column with a different data type

This patch is inspired and mostly based on the work done by @xwu0226 in this PR: #16626

PR highlights:

dropping columns is now allowed (b04f94a)
AlterTableAddColumnsCommand refactored to be more generic and support the implementation of AlterTableReplaceColumnsCommand (f8cbb96, 00f6376, da16d3e)
refactored how supported datasource formats for add and replace columns commands are validated (da16d3e)
implemented a SchemaUtils check to comparte two schemas and raise an exception if columns with the same name have different data types (8b6da23). This is based on the work done by @maropu here: f3dea60, 647963a

Known issues:

AlterTableReplaceColumnsCommand does not support CSV datasource format (13e3823)
If a column is dropped, then re-added later with a different data type, this would be allowed but likely to fail when querying the table, depending on the datasource format and the data types used.
-- However, this is actually not different from dropping the whole table and re-creating it with a schema that doesn’t match the underlying data.
-- In any case, if the underlying data doesn't match the table metadata, issues are likely. Hive DDL documentation specifically warns about this possibility and remits the responsibility to ensure constancy between data and metadata the user : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

TODO:

agree on requirements and best approach
add tests for replacing comments
add tests for complex data types
add test for columns re-ordering

How was this patch tested?

added tests cases
manually tested
run testOnly org.apache.spark.sql.* tests suite. all tests passing locally

- extract schema colums requirements checks to mehtod verifyColumnsToAddReplace - class AlterTableAddColumnsCommand extends from base abstract class - class AlterTableAddColumnsCommand uses verifyColumnsToAddReplace to check columns requirements - slight change in behaviour here : before DDLUtils.checkDataColNames (ie: check column names format) was done only on the columns to be added. now it's done on all columns, columns to be added and columns already in the table

…s to abstract class - refactor comments and exceptions to work for both add and replace operations

- remove 'replace columns' unsupported test - fix 'add columns' unsupported operations tests to match refactored exceptions messages

- add orc as supported formats to test for add colums as it is enabled in code - refactor do not support views tests to work for both add and replace columns - add tests for replace columns with duplicated columns - add tests for replace columns with case sensitive sql conf

- alterTableDataSchema overwrites the table schema with the new schema provided withtout checking if the current schema columns are also available in the new schema - align comments in other classes

- allow custom supported formats for add and replace commands, as add doesn't support text and replace doesn't support csv - consolidate the way the checks are performed to the 'canonical name parsing' approach - remove unused imports

gatorsmile · 2019-06-04T23:55:02Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

I am fine to add this syntax, but this only supports changing column comments, right?

Dropping the column is still risky to us. When you adding the columns back, we might see the unexpected values.

@gatorsmile the idea is actually to implement alter table replace colu�mns fully:

drop column if not preset in the new schema

add column, if not preset in the table’s schema

keep column if colum_name and colum_type match between schemas

raise exception if trying to replace an existing column with a different data type

I believe this is a much needed feature to be able to manage meta-stores fully from Spark; however I understand the complexity and risks of this operation, so please let’s take the time to ensure it’s the right thing to do and all possibile ramifications.

I tried to make the commit history as clean and as descriptive as possibile and just added some more details in the PR description known issues section.

I've also added an extra validation step to prevent columns to be replaced if the data type doesn't match (8b6da23)

Let me know if anything isn’t clear and I'll be happy add more details.

Thanks!

Let me give a typical scenario. For an existing table, if we drop the column col1, the data will not be visible to the end users. This is expected. However, if we add col1 back, the previous values of the column col1 should not be visible to the end users too. This is how the other database systems work. What is your thought?

i understand your point, it makes total sense and i would completely agree if hive was like the other dbms. according to your scenario, also the below should not be possibile, but it is:

>>> spark.sql("CREATE EXTERNAL TABLE test_table (c1 string, c2 int ) STORED AS PARQUET LOCATION '/tmp/test_table'") DataFrame[] >>> spark.sql("INSERT INTO test_table VALUES ('c1', 2)") DataFrame[] >>> spark.sql("SELECT * FROM test_table").show() +---+---+ | c1| c2| +---+---+ | c1| 2| +---+---+ >>> spark.sql("DROP TABLE test_table") DataFrame[] >>> spark.catalog.listTables() [] >>> spark.sql("CREATE EXTERNAL TABLE test_table (c1 string, c2 int, c3 boolean) STORED AS PARQUET LOCATION '/tmp/test_table'") DataFrame[] >>> spark.sql("SELECT * FROM test_table").show() +---+---+----+ | c1| c2| c3| +---+---+----+ | c1| 2|null| +---+---+----+

do you agree it's the same behaviour you explained in the "drop / re-create column" example?

@gatorsmile any thoughts about my previous comment?

i was wandering if this could be just a miscommunication issue as you are referring to the behaviour of alter table replace columns with "normal tables", while i'm manly talking about "external tables"?

maybe replace columns could be enabled only for external tables? would that be acceptable / less risky in your opinion?

gatorsmile · 2019-06-04T23:56:09Z

ok to test

SparkQA · 2019-06-05T01:41:47Z

Test build #106172 has finished for PR 24780 at commit da16d3e.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

- add method in `SchemaUtils` to comparte two schemas and raise an exception if columns with the same name have different data types - enable this check in `verifyColumnsToAddReplace` to prevent `alter table replace columns` to replace existing column with a column having a different data type

- replacing an existing column with a different data type is not allowed - added comments for an easier understanding - fixed buggy test "alter datasource table replace columns - partitioned" that was using the test function for non-partitioned tables

emanuelebardelli · 2019-06-05T12:05:21Z

@gengliangwang ported in this PR your changes from #24776 (0c3bc55)

SparkQA · 2019-06-05T12:39:23Z

Test build #106201 has finished for PR 24780 at commit 5915c88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

- `org.apache.spark.sql.hive.HiveExternalSessionCatalogSuite.alter table drop columns` was expecting an exception that's not being raised anymore (b04f94a) - renamed `DDLSuite.scala` tests to avoid name conflicts with `org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.alter datasource table add columns`

SparkQA · 2019-06-05T15:18:10Z

Test build #106202 has finished for PR 24780 at commit 30641eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

thesuperzapper · 2019-06-06T01:00:08Z

@emanuelebardelli

If we are trying to replicate Hive's REPLACE COLUMN command, we need to allow overwriting the entire schema of the table, regardless of if the final table will be readable.

(The only exception is for table formats which it doesn't make sense, like ones which determine columns from index rather than name)

I am not sure if your current approach will achieve this, can you confirm that:

You can fully replace columns (including comments)
You can reorder columns
You can change the type of columns
- There are cases where a new schema is able to read both old and new data without issues. What is allowed, depends on the format, e.g. PARQUET/AVRO. (This is often used for schema evolution)
- Example: STRUCT<field1: STRING> → STRUCT<field1: STRING, field2: STRING>
  - Note that this is a column of type STRUCT, not the full schema of a table

Perhaps you could look into using one of the existing methods to overwrite the schema of the table, they are implemented in the various external catalogue classes. For example HiveExternalCatalog.scala, which calls HiveClient.scala.

emanuelebardelli · 2019-06-06T08:32:30Z

@thesuperzapper

yes, the idea is to implement replace columns fully, but it has to be done in a "spark-safe" way.
at the moment, even the core functionality required for this operation (ie: dropping columns) is not allowed and still being questioned (see above conversation).

in relation to your use cases:

yes, can fully replace a column
-- i have not tested with comments, when there's an agreement if this PR is actually viable, i'll add them
yes
yes and no : by default it is possibile to change the type of a column while replacing it. however, i have specifically disabled this option in an attempt to make the replace columns operation safer (8b6da23).
-- i have not tested complex data types (ie: STRUCT). when there's an agreement if this PR is actually viable, i'll add them

emanuelebardelli · 2019-07-25T10:24:20Z

closing it for now as much as changed and the refactoring would probably require more effort than implementing it again over the new project structure

emanuelebardelli mentioned this pull request Jun 3, 2019

[SPARK-27926][SQL] Allow altering table add columns with CSVFileFormat/JsonFileFormat provider #24776

Closed

manu-olx added 10 commits June 4, 2019 15:01

enable 'alter table replace columns' command in sql parser

a99e362

extract method verifying feasibility of add/replace columns operation…

4f52b94

…s to abstract class - refactor comments and exceptions to work for both add and replace operations

add AlterTableReplaceColumnsCommand class

59332d0

fix failing tests

01ce36c

- remove 'replace columns' unsupported test - fix 'add columns' unsupported operations tests to match refactored exceptions messages

remove safety net preventing to drop columns

e680632

- alterTableDataSchema overwrites the table schema with the new schema provided withtout checking if the current schema columns are also available in the new schema - align comments in other classes

add alter table replace columns tests

d75509e

refactor verifyAlterTableAddReplaceColumn

aa6026f

- allow custom supported formats for add and replace commands, as add doesn't support text and replace doesn't support csv - consolidate the way the checks are performed to the 'canonical name parsing' approach - remove unused imports

rebase and implement apache#24776 changes

0c3bc55

emanuelebardelli changed the title ~~[SPARK-26388][SQL]_implement_hive_ddl_alter_table_replace_columns~~ [SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns Jun 4, 2019

gatorsmile reviewed Jun 4, 2019

View reviewed changes

manu-olx added 2 commits June 5, 2019 12:13

emanuelebardelli force-pushed the SPARK-26388_SQL_implement_hive_ddl_alter_table_replace_columns branch from da16d3e to 5915c88 Compare June 5, 2019 11:08

emanuelebardelli changed the title ~~[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns~~ [SPARK-26388][SQL]_implement_hive_ddl_alter_table_replace_columns Jun 5, 2019

emanuelebardelli changed the title ~~[SPARK-26388][SQL]_implement_hive_ddl_alter_table_replace_columns~~ [SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns Jun 5, 2019

dongjoon-hyun added the SQL label Jun 14, 2019

emanuelebardelli closed this Jul 25, 2019

[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns #24780

[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns #24780

Uh oh!

Conversation

emanuelebardelli commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

PR highlights:

Known issues:

TODO:

How was this patch tested?

Uh oh!

gatorsmile Jun 4, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 4, 2019

Choose a reason for hiding this comment

Uh oh!

emanuelebardelli Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

emanuelebardelli Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

emanuelebardelli Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

emanuelebardelli Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 4, 2019

Uh oh!

SparkQA commented Jun 5, 2019

Uh oh!

emanuelebardelli commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 5, 2019

Uh oh!

SparkQA commented Jun 5, 2019

Uh oh!

thesuperzapper commented Jun 6, 2019

Uh oh!

emanuelebardelli commented Jun 6, 2019

Uh oh!

emanuelebardelli commented Jul 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

emanuelebardelli commented Jun 3, 2019 •

edited

Loading

emanuelebardelli commented Jun 5, 2019 •

edited

Loading