Skip to content

Conversation

@xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Nov 17, 2017

What changes were proposed in this pull request?

Support user to change column dataType in hive table and datasource table, also make sure the new changed data type can work with all data sources.

  • DDL support for ALTER TABLE CHANGE COLUMN
  • Support in parquet vectorized reader
  • Support in parquet row reader
  • Support in orc vectorized reader
  • Support in orc row reader

How was this patch tested?

Add test case in DDLSuite.scala and SQLQueryTestSuite.scala

@xuanyuanking xuanyuanking changed the title Supporting for changing column dataType [SPARK-22546][SQL] Supporting for changing column dataType Nov 17, 2017
@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83964 has finished for PR 19773 at commit 1bcd74f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend getting rid of this var and re-writting the code as follows:

val newField = newColumn.getComment.map(...).getOrElse(field)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More clear for getting rid of var, pls check next patch. If we implement rename or others meta change feature here, may still need some code rework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming the val to typeChanged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/change/changing + s/take/takes

@SparkQA
Copy link

SparkQA commented Nov 20, 2017

Test build #84012 has finished for PR 19773 at commit b145102.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2017

Test build #84022 has finished for PR 19773 at commit 7b9fb1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

@jaceklaskowski Thanks for your review and comments, I rebased the branch and addressed all comments, this patch is now ready for next reviewing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Hive's behavior if users change the column type of partition schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE-3672 Hive support this by adding new command of ALTER TABLE <table_name> PARTITION COLUMN (<column_name> <new type>).
So here maybe I should throw an AnalysisException while user change the type of partition column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the checking logic in next commit and fix bug for changing comment of partition column.

@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84164 has finished for PR 19773 at commit 77626e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

gental ping @gatorsmile

@maropu
Copy link
Member

maropu commented Jul 23, 2018

@xuanyuanking Any update?

@xuanyuanking
Copy link
Member Author

I'll resolve the conflicts today, thanks for ping me.

@xuanyuanking
Copy link
Member Author

@gatorsmile @maropu Please have a look about this, solving the conflicts takes me some time.
Also cc @jiangxb1987 because the conflict mainly with #20696, also thanks for the work in #20696, the latest pr no longer need to do the extra work for partition column comment changing as before.

val partitionColumnChanged = table.partitionColumnNames.contains(originColumn.name)

// Throw an AnalysisException if the type of partition column is changed.
if (typeChanged && partitionColumnChanged) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a check here when user changing the type of partition columns.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93504 has finished for PR 19773 at commit d8982b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

gentle ping @maropu, could you help to review this? I'll keep follow up this.

val originColumn = findColumnByName(table.dataSchema, columnName, resolver)
// Throw an AnalysisException if the column name/dataType is changed.
// Throw an AnalysisException if the column name is changed.
if (!columnEqual(originColumn, newColumn, resolver)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its ok to check names only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, not enough yet, add type compatible check in ef65c4d.

throw new AnalysisException(
"ALTER TABLE CHANGE COLUMN is not supported for changing column " +
s"'${originColumn.name}' with type '${originColumn.dataType}' to " +
s"'${newColumn.name}' with type '${newColumn.dataType}'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After add the type check, maybe we also need the type message in error message.

// Ensure that changing partition column type throw exception
intercept[AnalysisException] {
sql("ALTER TABLE dbx.tab1 CHANGE COLUMN a a STRING")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please compare the error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in ef65c4d. Also add check for type compatible check.

private def addComment(column: StructField, comment: Option[String]): StructField = {
comment.map(column.withComment(_)).getOrElse(column)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we need data conversion (e.g., from ing to double?) in binary formats (parquet and orc)? Also, What happens if we get incompatible type changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for advise, I should also check the type compatible, add in ef65c4d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we need to comply with the Hive behaivour. Is the current fix (by casting) the same with Hive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your question, actually that's also what I'm consider during do the compatible check. Hive do this column type change work in HiveAlterHandler and the detailed compatible check is in ColumnType. You can see in the ColumnType checking work, it actually use the canCast semantic to judge compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Thanks for the check. btw, have you checked if this could work correctly?


sql("""CREATE TABLE t(a INT, b STRING, c INT) using parquet""")
sql("""INSERT INTO t VALUES (1, 'a', 3)""")
sql("""ALTER TABLE t CHANGE a a STRING""")
spark.table("t").show
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///Users/maropu/Repositories/spark/spark-master/spark-warehouse/t/part-00000-93ddfd05-690a-480c-8cc5-fd0981206fc3-c000.snappy.parquet. Column: [a], Expected: string, Found: INT32
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:192)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.s
    ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, in this pr, we need an additional logic to cast input data into a changed type in catalog when reading....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advise!
I look into this in these days. With currently implement, all behavior comply with Hive(Support type change/Work well in non binary format/Exception in binary format like orc and parquet). Is it ok to add a config for constraint this?

The work of adding logic to cast input data into changed type in catalog may need modifying 4 parts logic including vectorized reader and row reader in parquet and orc. If we don't agree the currently behavior, I'll keep following these.

Item Behavior
Parquet Row Reader ClassCastException in SpecificInternalRow.set${Type}
Parquet Vectorized Reader SchemaColumnConvertNotSupportedException in VectorizedColumnReader.read${Type}Batch
Orc Row Reader ClassCastException in OrcDeserializer.newWriter
Orc Vectorized Reader NullPointerException in OrcColumnVector get value by type method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check!. I think we don't always need to comply with the Hive behaivour and an understandable behaivour for users is the best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @maropu .

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95783 has finished for PR 19773 at commit ef65c4d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95795 has finished for PR 19773 at commit ef65c4d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95798 has finished for PR 19773 at commit ef65c4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xuanyuanking

Thank you for contribution. This is a meaningful work for Apache Spark 2.5. I think we need some more improvements because the following is not true in Apache Spark.

With currently implement, all behavior comply with Hive(Support type change/Work well in non binary format/Exception in binary format like orc and parquet). Is it ok to add a config for constraint this?

Apache Spark already supports changing column types as a part of schema evolution. Especially, ORC vectorized reader support upcasting although it's not the same with canCast.

For the detail support Spark coverage, see SPARK-23007. It covered all built-in data source at that time.

Please note that every data sources have different capability. So, this PR needs to prevent ALTER TABLE CHANGE COLUMN for those data sources case-by-case. And, we need corresponding test cases.

@xuanyuanking
Copy link
Member Author

@maropu @dongjoon-hyun Great thanks for your guidance !

Apache Spark already supports changing column types as a part of schema evolution. Especially, ORC vectorized reader support upcasting although it's not the same with canCast.

For the detail support Spark coverage, see SPARK-23007. It covered all built-in data source at that time.

Great thanks, I'll study these background soon.

Please note that every data sources have different capability. So, this PR needs to prevent ALTER TABLE CHANGE COLUMN for those data sources case-by-case. And, we need corresponding test cases.

Got it, I'll keep following the cases in this PR, I roughly split these into 4 tasks and update the description of this PR firstly. I'll pay attention to the corresponding test cases in each task.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 14, 2020
@github-actions github-actions bot closed this Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants