[SPARK-22546][SQL] Supporting for changing column dataType #19773

xuanyuanking · 2017-11-17T12:21:40Z

What changes were proposed in this pull request?

Support user to change column dataType in hive table and datasource table, also make sure the new changed data type can work with all data sources.

DDL support for ALTER TABLE CHANGE COLUMN
Support in parquet vectorized reader
Support in parquet row reader
Support in orc vectorized reader
Support in orc row reader

How was this patch tested?

Add test case in DDLSuite.scala and SQLQueryTestSuite.scala

SparkQA · 2017-11-17T12:27:45Z

Test build #83964 has finished for PR 19773 at commit 1bcd74f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2017-11-17T17:12:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

I'd recommend getting rid of this var and re-writting the code as follows:

val newField = newColumn.getComment.map(...).getOrElse(field)

More clear for getting rid of var, pls check next patch. If we implement rename or others meta change feature here, may still need some code rework.

jaceklaskowski · 2017-11-17T17:12:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

What do you think about renaming the val to typeChanged?

jaceklaskowski · 2017-11-17T17:14:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

s/change/changing + s/take/takes

SparkQA · 2017-11-20T06:58:52Z

Test build #84012 has finished for PR 19773 at commit b145102.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-20T12:32:53Z

Test build #84022 has finished for PR 19773 at commit 7b9fb1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-11-20T13:29:56Z

@jaceklaskowski Thanks for your review and comments, I rebased the branch and addressed all comments, this patch is now ready for next reviewing.

gatorsmile · 2017-11-22T06:43:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

What is the Hive's behavior if users change the column type of partition schema?

HIVE-3672 Hive support this by adding new command of ALTER TABLE <table_name> PARTITION COLUMN (<column_name> <new type>).
So here maybe I should throw an AnalysisException while user change the type of partition column?

I add the checking logic in next commit and fix bug for changing comment of partition column.

SparkQA · 2017-11-24T15:05:45Z

Test build #84164 has finished for PR 19773 at commit 77626e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-12-15T04:01:20Z

gental ping @gatorsmile

maropu · 2018-07-23T15:32:35Z

@xuanyuanking Any update?

xuanyuanking · 2018-07-23T15:47:36Z

I'll resolve the conflicts today, thanks for ping me.

…t of partition column

xuanyuanking · 2018-07-24T15:34:21Z

@gatorsmile @maropu Please have a look about this, solving the conflicts takes me some time.
Also cc @jiangxb1987 because the conflict mainly with #20696, also thanks for the work in #20696, the latest pr no longer need to do the extra work for partition column comment changing as before.

xuanyuanking · 2018-07-24T15:35:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

+    val partitionColumnChanged = table.partitionColumnNames.contains(originColumn.name)
+
+    // Throw an AnalysisException if the type of partition column is changed.
+    if (typeChanged && partitionColumnChanged) {


Just adding a check here when user changing the type of partition columns.

SparkQA · 2018-07-24T19:09:05Z

Test build #93504 has finished for PR 19773 at commit d8982b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-09-05T13:30:46Z

gentle ping @maropu, could you help to review this? I'll keep follow up this.

maropu · 2018-09-05T13:54:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

    val originColumn = findColumnByName(table.dataSchema, columnName, resolver)
-    // Throw an AnalysisException if the column name/dataType is changed.
+    // Throw an AnalysisException if the column name is changed.
    if (!columnEqual(originColumn, newColumn, resolver)) {


Its ok to check names only?

Thanks, not enough yet, add type compatible check in ef65c4d.

maropu · 2018-09-05T13:54:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

      throw new AnalysisException(
        "ALTER TABLE CHANGE COLUMN is not supported for changing column " +
          s"'${originColumn.name}' with type '${originColumn.dataType}' to " +
          s"'${newColumn.name}' with type '${newColumn.dataType}'")


Can you update this error message?

After add the type check, maybe we also need the type message in error message.

maropu · 2018-09-05T13:56:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+    // Ensure that changing partition column type throw exception
+    intercept[AnalysisException] {
+      sql("ALTER TABLE dbx.tab1 CHANGE COLUMN a a STRING")
+    }


Please compare the error message.

Thanks, done in ef65c4d. Also add check for type compatible check.

maropu · 2018-09-05T14:03:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

-  private def addComment(column: StructField, comment: Option[String]): StructField = {
-    comment.map(column.withComment(_)).getOrElse(column)
-  }
-


What happens if we need data conversion (e.g., from ing to double?) in binary formats (parquet and orc)? Also, What happens if we get incompatible type changes?

Thanks for advise, I should also check the type compatible, add in ef65c4d.

Probably, we need to comply with the Hive behaivour. Is the current fix (by casting) the same with Hive?

Thanks for your question, actually that's also what I'm consider during do the compatible check. Hive do this column type change work in HiveAlterHandler and the detailed compatible check is in ColumnType. You can see in the ColumnType checking work, it actually use the canCast semantic to judge compatible.

Ah, ok. Thanks for the check. btw, have you checked if this could work correctly?

sql("""CREATE TABLE t(a INT, b STRING, c INT) using parquet""") sql("""INSERT INTO t VALUES (1, 'a', 3)""") sql("""ALTER TABLE t CHANGE a a STRING""") spark.table("t").show org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///Users/maropu/Repositories/spark/spark-master/spark-warehouse/t/part-00000-93ddfd05-690a-480c-8cc5-fd0981206fc3-c000.snappy.parquet. Column: [a], Expected: string, Found: INT32 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:192) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.s ...

In my opinion, in this pr, we need an additional logic to cast input data into a changed type in catalog when reading....

Thanks for your advise!
I look into this in these days. With currently implement, all behavior comply with Hive(Support type change/Work well in non binary format/Exception in binary format like orc and parquet). Is it ok to add a config for constraint this?

The work of adding logic to cast input data into changed type in catalog may need modifying 4 parts logic including vectorized reader and row reader in parquet and orc. If we don't agree the currently behavior, I'll keep following these.

Item Behavior

Parquet Row Reader ClassCastException in SpecificInternalRow.set${Type}

Parquet Vectorized Reader SchemaColumnConvertNotSupportedException in VectorizedColumnReader.read${Type}Batch

Orc Row Reader ClassCastException in OrcDeserializer.newWriter

Orc Vectorized Reader NullPointerException in OrcColumnVector get value by type method

cc: @gatorsmile @dongjoon-hyun

Thanks for the check!. I think we don't always need to comply with the Hive behaivour and an understandable behaivour for users is the best.

Thank you for pinging me, @maropu .

SparkQA · 2018-09-07T07:05:01Z

Test build #95783 has finished for PR 19773 at commit ef65c4d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-09-07T09:19:43Z

retest this please.

SparkQA · 2018-09-07T11:10:33Z

Test build #95795 has finished for PR 19773 at commit ef65c4d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-09-07T11:17:34Z

retest this please.

SparkQA · 2018-09-07T15:08:24Z

Test build #95798 has finished for PR 19773 at commit ef65c4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Hi, @xuanyuanking

Thank you for contribution. This is a meaningful work for Apache Spark 2.5. I think we need some more improvements because the following is not true in Apache Spark.

With currently implement, all behavior comply with Hive(Support type change/Work well in non binary format/Exception in binary format like orc and parquet). Is it ok to add a config for constraint this?

Apache Spark already supports changing column types as a part of schema evolution. Especially, ORC vectorized reader support upcasting although it's not the same with canCast.

For the detail support Spark coverage, see SPARK-23007. It covered all built-in data source at that time.

Please note that every data sources have different capability. So, this PR needs to prevent ALTER TABLE CHANGE COLUMN for those data sources case-by-case. And, we need corresponding test cases.

xuanyuanking · 2018-09-19T03:42:46Z

@maropu @dongjoon-hyun Great thanks for your guidance !

Apache Spark already supports changing column types as a part of schema evolution. Especially, ORC vectorized reader support upcasting although it's not the same with canCast.

For the detail support Spark coverage, see SPARK-23007. It covered all built-in data source at that time.

Great thanks, I'll study these background soon.

Please note that every data sources have different capability. So, this PR needs to prevent ALTER TABLE CHANGE COLUMN for those data sources case-by-case. And, we need corresponding test cases.

Got it, I'll keep following the cases in this PR, I roughly split these into 4 tasks and update the description of this PR firstly. I'll pay attention to the corresponding test cases in each task.

github-actions · 2020-01-14T00:06:01Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

xuanyuanking changed the title ~~Supporting for changing column dataType~~ [SPARK-22546][SQL] Supporting for changing column dataType Nov 17, 2017

jaceklaskowski reviewed Nov 17, 2017

View reviewed changes

xuanyuanking force-pushed the SPARK-22546 branch from 1bcd74f to b145102 Compare November 20, 2017 05:18

gatorsmile reviewed Nov 22, 2017

View reviewed changes

jaceklaskowski approved these changes Nov 28, 2017

View reviewed changes

xuanyuanking added 3 commits July 24, 2018 22:30

Support change column dataType

fd16d6b

Address comments

0cf1a9b

Add the checking logic in next commit and fix bug for changing commen…

d8982b1

…t of partition column

xuanyuanking force-pushed the SPARK-22546 branch from 77626e9 to d8982b1 Compare July 24, 2018 15:30

xuanyuanking commented Jul 24, 2018

View reviewed changes

maropu reviewed Sep 5, 2018

View reviewed changes

Add check for data compatible

ef65c4d

dongjoon-hyun requested changes Sep 18, 2018

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 14, 2020

github-actions bot closed this Jan 15, 2020

Item	Behavior
Parquet Row Reader	ClassCastException in SpecificInternalRow.set${Type}
Parquet Vectorized Reader	SchemaColumnConvertNotSupportedException in VectorizedColumnReader.read${Type}Batch
Orc Row Reader	ClassCastException in OrcDeserializer.newWriter
Orc Vectorized Reader	NullPointerException in OrcColumnVector get value by type method

[SPARK-22546][SQL] Supporting for changing column dataType #19773

[SPARK-22546][SQL] Supporting for changing column dataType #19773

Uh oh!

Conversation

xuanyuanking commented Nov 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 20, 2017

Uh oh!

SparkQA commented Nov 20, 2017

Uh oh!

xuanyuanking commented Nov 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2017

Uh oh!

xuanyuanking commented Dec 15, 2017

Uh oh!

maropu commented Jul 23, 2018

Uh oh!

xuanyuanking commented Jul 23, 2018

Uh oh!

xuanyuanking commented Jul 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

xuanyuanking commented Sep 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 7, 2018

xuanyuanking commented Nov 17, 2017 •

edited

Loading