-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns #24780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26388][SQL][WIP]_implement_hive_ddl_alter_table_replace_columns #24780
Conversation
- extract schema colums requirements checks to mehtod verifyColumnsToAddReplace - class AlterTableAddColumnsCommand extends from base abstract class - class AlterTableAddColumnsCommand uses verifyColumnsToAddReplace to check columns requirements - slight change in behaviour here : before DDLUtils.checkDataColNames (ie: check column names format) was done only on the columns to be added. now it's done on all columns, columns to be added and columns already in the table
…s to abstract class - refactor comments and exceptions to work for both add and replace operations
- remove 'replace columns' unsupported test - fix 'add columns' unsupported operations tests to match refactored exceptions messages
- add orc as supported formats to test for add colums as it is enabled in code - refactor do not support views tests to work for both add and replace columns - add tests for replace columns with duplicated columns - add tests for replace columns with case sensitive sql conf
- alterTableDataSchema overwrites the table schema with the new schema provided withtout checking if the current schema columns are also available in the new schema - align comments in other classes
- allow custom supported formats for add and replace commands, as add doesn't support text and replace doesn't support csv - consolidate the way the checks are performed to the 'canonical name parsing' approach - remove unused imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to add this syntax, but this only supports changing column comments, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping the column is still risky to us. When you adding the columns back, we might see the unexpected values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile the idea is actually to implement alter table replace colu�mns fully:
- drop column if not preset in the new schema
- add column, if not preset in the table’s schema
- keep column if colum_name and colum_type match between schemas
- raise exception if trying to replace an existing column with a different data type
I believe this is a much needed feature to be able to manage meta-stores fully from Spark; however I understand the complexity and risks of this operation, so please let’s take the time to ensure it’s the right thing to do and all possibile ramifications.
I tried to make the commit history as clean and as descriptive as possibile and just added some more details in the PR description known issues section.
I've also added an extra validation step to prevent columns to be replaced if the data type doesn't match (8b6da23)
Let me know if anything isn’t clear and I'll be happy add more details.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give a typical scenario. For an existing table, if we drop the column col1, the data will not be visible to the end users. This is expected. However, if we add col1 back, the previous values of the column col1 should not be visible to the end users too. This is how the other database systems work. What is your thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand your point, it makes total sense and i would completely agree if hive was like the other dbms. according to your scenario, also the below should not be possibile, but it is:
>>> spark.sql("CREATE EXTERNAL TABLE test_table (c1 string, c2 int ) STORED AS PARQUET LOCATION '/tmp/test_table'")
DataFrame[]
>>> spark.sql("INSERT INTO test_table VALUES ('c1', 2)")
DataFrame[]
>>> spark.sql("SELECT * FROM test_table").show()
+---+---+
| c1| c2|
+---+---+
| c1| 2|
+---+---+
>>> spark.sql("DROP TABLE test_table")
DataFrame[]
>>> spark.catalog.listTables()
[]
>>> spark.sql("CREATE EXTERNAL TABLE test_table (c1 string, c2 int, c3 boolean) STORED AS PARQUET LOCATION '/tmp/test_table'")
DataFrame[]
>>> spark.sql("SELECT * FROM test_table").show()
+---+---+----+
| c1| c2| c3|
+---+---+----+
| c1| 2|null|
+---+---+----+
do you agree it's the same behaviour you explained in the "drop / re-create column" example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile any thoughts about my previous comment?
i was wandering if this could be just a miscommunication issue as you are referring to the behaviour of alter table replace columns with "normal tables", while i'm manly talking about "external tables"?
maybe replace columns could be enabled only for external tables? would that be acceptable / less risky in your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile any thoughts about my previous comment?
i was wandering if this could be just a miscommunication issue as you are referring to the behaviour of alter table replace columns with "normal tables", while i'm manly talking about "external tables"?
maybe replace columns could be enabled only for external tables? would that be acceptable / less risky in your opinion?
|
ok to test |
|
Test build #106172 has finished for PR 24780 at commit
|
- add method in `SchemaUtils` to comparte two schemas and raise an exception if columns with the same name have different data types - enable this check in `verifyColumnsToAddReplace` to prevent `alter table replace columns` to replace existing column with a column having a different data type
- replacing an existing column with a different data type is not allowed - added comments for an easier understanding - fixed buggy test "alter datasource table replace columns - partitioned" that was using the test function for non-partitioned tables
da16d3e to
5915c88
Compare
|
@gengliangwang ported in this PR your changes from #24776 (0c3bc55) |
|
Test build #106201 has finished for PR 24780 at commit
|
- `org.apache.spark.sql.hive.HiveExternalSessionCatalogSuite.alter table drop columns` was expecting an exception that's not being raised anymore (b04f94a) - renamed `DDLSuite.scala` tests to avoid name conflicts with `org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.alter datasource table add columns`
|
Test build #106202 has finished for PR 24780 at commit
|
|
If we are trying to replicate Hive's REPLACE COLUMN command, we need to allow overwriting the entire schema of the table, regardless of if the final table will be readable. (The only exception is for table formats which it doesn't make sense, like ones which determine columns from index rather than name) I am not sure if your current approach will achieve this, can you confirm that:
Perhaps you could look into using one of the existing methods to overwrite the schema of the table, they are implemented in the various external catalogue classes. For example HiveExternalCatalog.scala, which calls HiveClient.scala. |
|
yes, the idea is to implement in relation to your use cases:
|
|
closing it for now as much as changed and the refactoring would probably require more effort than implementing it again over the new project structure |
What changes were proposed in this pull request?
Add support for
ALTER TABLE REPLACE COLUMNScolumn.nameandcolumn.dataTypematch between table and new schemasThis patch is inspired and mostly based on the work done by @xwu0226 in this PR: #16626
PR highlights:
AlterTableAddColumnsCommandrefactored to be more generic and support the implementation ofAlterTableReplaceColumnsCommand(f8cbb96, 00f6376, da16d3e)SchemaUtilscheck to comparte two schemas and raise an exception if columns with the same name have different data types (8b6da23). This is based on the work done by @maropu here: f3dea60, 647963aKnown issues:
AlterTableReplaceColumnsCommanddoes not support CSV datasource format (13e3823)-- However, this is actually not different from dropping the whole table and re-creating it with a schema that doesn’t match the underlying data.
-- In any case, if the underlying data doesn't match the table metadata, issues are likely. Hive DDL documentation specifically warns about this possibility and remits the responsibility to ensure constancy between data and metadata the user : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns
TODO:
How was this patch tested?
testOnly org.apache.spark.sql.*tests suite. all tests passing locally