Skip to content

Conversation

@emanuelebardelli
Copy link

@emanuelebardelli emanuelebardelli commented Jun 3, 2019

What changes were proposed in this pull request?

Add support for ALTER TABLE REPLACE COLUMNS

  • drop column if not preset in the new schema
  • add column if not preset in the table’s schema
  • keep column if column.name and column.dataType match between table and new schemas
  • raise exception if trying to replace an existing column with a different data type

This patch is inspired and mostly based on the work done by @xwu0226 in this PR: #16626

PR highlights:

  • dropping columns is now allowed (b04f94a)
  • AlterTableAddColumnsCommand refactored to be more generic and support the implementation of AlterTableReplaceColumnsCommand (f8cbb96, 00f6376, da16d3e)
  • refactored how supported datasource formats for add and replace columns commands are validated (da16d3e)
  • implemented a SchemaUtils check to comparte two schemas and raise an exception if columns with the same name have different data types (8b6da23). This is based on the work done by @maropu here: f3dea60, 647963a

Known issues:

  • AlterTableReplaceColumnsCommand does not support CSV datasource format (13e3823)
  • If a column is dropped, then re-added later with a different data type, this would be allowed but likely to fail when querying the table, depending on the datasource format and the data types used.
    -- However, this is actually not different from dropping the whole table and re-creating it with a schema that doesn’t match the underlying data.
    -- In any case, if the underlying data doesn't match the table metadata, issues are likely. Hive DDL documentation specifically warns about this possibility and remits the responsibility to ensure constancy between data and metadata the user : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

TODO:

  • agree on requirements and best approach
  • add tests for replacing comments
  • add tests for complex data types
  • add test for columns re-ordering

How was this patch tested?

  • added tests cases
  • manually tested
  • run testOnly org.apache.spark.sql.* tests suite. all tests passing locally

manu-olx added 10 commits June 4, 2019 15:01
- extract schema colums requirements checks to mehtod verifyColumnsToAddReplace
- class AlterTableAddColumnsCommand extends from base abstract class
- class AlterTableAddColumnsCommand uses verifyColumnsToAddReplace to check columns requirements
  - slight change in behaviour here : before DDLUtils.checkDataColNames (ie: check column names format) was done only on the columns to be added. now it's done on all columns, columns to be added and columns already in the table
…s to abstract class

- refactor comments and exceptions to work for both add and replace operations
- remove 'replace columns' unsupported test
- fix 'add columns' unsupported operations tests to match refactored exceptions messages
- add orc as supported formats to test for add colums as it is enabled in code
- refactor do not support views tests to work for both add and replace columns
- add tests for replace columns with duplicated columns
- add tests for replace columns with case sensitive sql conf
- alterTableDataSchema overwrites the table schema with the new schema provided withtout checking if the current schema columns are also available in the new schema
- align comments in other classes
- allow custom supported formats for add and replace commands, as add doesn't support text and replace doesn't support csv
- consolidate the way the checks are performed to the 'canonical name parsing' approach
- remove unused imports
@emanuelebardelli emanuelebardelli changed the title [SPARK-26388][SQL]_implement_hive_ddl_alter_table_replace_columns [SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns Jun 4, 2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to add this syntax, but this only supports changing column comments, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the column is still risky to us. When you adding the columns back, we might see the unexpected values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile the idea is actually to implement alter table replace colu�mns fully:

  • drop column if not preset in the new schema
  • add column, if not preset in the table’s schema
  • keep column if colum_name and colum_type match between schemas
  • raise exception if trying to replace an existing column with a different data type

I believe this is a much needed feature to be able to manage meta-stores fully from Spark; however I understand the complexity and risks of this operation, so please let’s take the time to ensure it’s the right thing to do and all possibile ramifications.

I tried to make the commit history as clean and as descriptive as possibile and just added some more details in the PR description known issues section.

I've also added an extra validation step to prevent columns to be replaced if the data type doesn't match (8b6da23)

Let me know if anything isn’t clear and I'll be happy add more details.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give a typical scenario. For an existing table, if we drop the column col1, the data will not be visible to the end users. This is expected. However, if we add col1 back, the previous values of the column col1 should not be visible to the end users too. This is how the other database systems work. What is your thought?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand your point, it makes total sense and i would completely agree if hive was like the other dbms. according to your scenario, also the below should not be possibile, but it is:

>>> spark.sql("CREATE EXTERNAL TABLE test_table (c1 string, c2 int ) STORED AS PARQUET LOCATION '/tmp/test_table'")
DataFrame[]

>>> spark.sql("INSERT INTO test_table VALUES ('c1', 2)")
DataFrame[]

>>> spark.sql("SELECT * FROM test_table").show()
+---+---+
| c1| c2|
+---+---+
| c1|  2|
+---+---+

>>> spark.sql("DROP TABLE test_table")
DataFrame[]

>>> spark.catalog.listTables()
[]

>>> spark.sql("CREATE EXTERNAL TABLE test_table (c1 string, c2 int, c3 boolean) STORED AS PARQUET LOCATION '/tmp/test_table'")
DataFrame[]

>>> spark.sql("SELECT * FROM test_table").show()
+---+---+----+
| c1| c2|  c3|
+---+---+----+
| c1|  2|null|
+---+---+----+

do you agree it's the same behaviour you explained in the "drop / re-create column" example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile any thoughts about my previous comment?

i was wandering if this could be just a miscommunication issue as you are referring to the behaviour of alter table replace columns with "normal tables", while i'm manly talking about "external tables"?

maybe replace columns could be enabled only for external tables? would that be acceptable / less risky in your opinion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile any thoughts about my previous comment?

i was wandering if this could be just a miscommunication issue as you are referring to the behaviour of alter table replace columns with "normal tables", while i'm manly talking about "external tables"?

maybe replace columns could be enabled only for external tables? would that be acceptable / less risky in your opinion?

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106172 has finished for PR 24780 at commit da16d3e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

manu-olx added 2 commits June 5, 2019 12:13
- add method in `SchemaUtils` to comparte two schemas and raise an exception if columns with the same name have different data types
- enable this check in `verifyColumnsToAddReplace` to prevent `alter table replace columns` to replace  existing column with a column having a different data type
- replacing an existing column with a different data type is not allowed
- added comments for an easier understanding
- fixed buggy test "alter datasource table replace columns - partitioned" that was using the test function for non-partitioned tables
@emanuelebardelli emanuelebardelli force-pushed the SPARK-26388_SQL_implement_hive_ddl_alter_table_replace_columns branch from da16d3e to 5915c88 Compare June 5, 2019 11:08
@emanuelebardelli emanuelebardelli changed the title [SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns [SPARK-26388][SQL]_implement_hive_ddl_alter_table_replace_columns Jun 5, 2019
@emanuelebardelli
Copy link
Author

emanuelebardelli commented Jun 5, 2019

@gengliangwang ported in this PR your changes from #24776 (0c3bc55)

@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106201 has finished for PR 24780 at commit 5915c88.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- `org.apache.spark.sql.hive.HiveExternalSessionCatalogSuite.alter table drop columns` was expecting an exception that's not being raised anymore (b04f94a)
- renamed `DDLSuite.scala` tests to avoid name conflicts with `org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.alter datasource table add columns`
@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106202 has finished for PR 24780 at commit 30641eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@emanuelebardelli emanuelebardelli changed the title [SPARK-26388][SQL]_implement_hive_ddl_alter_table_replace_columns [SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns Jun 5, 2019
@thesuperzapper
Copy link

@emanuelebardelli

If we are trying to replicate Hive's REPLACE COLUMN command, we need to allow overwriting the entire schema of the table, regardless of if the final table will be readable.

(The only exception is for table formats which it doesn't make sense, like ones which determine columns from index rather than name)

I am not sure if your current approach will achieve this, can you confirm that:

  1. You can fully replace columns (including comments)
  2. You can reorder columns
  3. You can change the type of columns
    • There are cases where a new schema is able to read both old and new data without issues. What is allowed, depends on the format, e.g. PARQUET/AVRO. (This is often used for schema evolution)
    • Example: STRUCT<field1: STRING>STRUCT<field1: STRING, field2: STRING>
      • Note that this is a column of type STRUCT, not the full schema of a table

Perhaps you could look into using one of the existing methods to overwrite the schema of the table, they are implemented in the various external catalogue classes. For example HiveExternalCatalog.scala, which calls HiveClient.scala.

@emanuelebardelli
Copy link
Author

@thesuperzapper

yes, the idea is to implement replace columns fully, but it has to be done in a "spark-safe" way.
at the moment, even the core functionality required for this operation (ie: dropping columns) is not allowed and still being questioned (see above conversation).

in relation to your use cases:

  1. yes, can fully replace a column
    -- i have not tested with comments, when there's an agreement if this PR is actually viable, i'll add them

  2. yes

  3. yes and no : by default it is possibile to change the type of a column while replacing it. however, i have specifically disabled this option in an attempt to make the replace columns operation safer (8b6da23).
    -- i have not tested complex data types (ie: STRUCT). when there's an agreement if this PR is actually viable, i'll add them

@emanuelebardelli
Copy link
Author

closing it for now as much as changed and the refactoring would probably require more effort than implementing it again over the new project structure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants