-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22546][SQL] Supporting for changing column dataType #19773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #83964 has finished for PR 19773 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend getting rid of this var and re-writting the code as follows:
val newField = newColumn.getComment.map(...).getOrElse(field)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More clear for getting rid of var, pls check next patch. If we implement rename or others meta change feature here, may still need some code rework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about renaming the val to typeChanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/change/changing + s/take/takes
1bcd74f to
b145102
Compare
|
Test build #84012 has finished for PR 19773 at commit
|
|
Test build #84022 has finished for PR 19773 at commit
|
|
@jaceklaskowski Thanks for your review and comments, I rebased the branch and addressed all comments, this patch is now ready for next reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the Hive's behavior if users change the column type of partition schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HIVE-3672 Hive support this by adding new command of ALTER TABLE <table_name> PARTITION COLUMN (<column_name> <new type>).
So here maybe I should throw an AnalysisException while user change the type of partition column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the checking logic in next commit and fix bug for changing comment of partition column.
|
Test build #84164 has finished for PR 19773 at commit
|
|
gental ping @gatorsmile |
|
@xuanyuanking Any update? |
|
I'll resolve the conflicts today, thanks for ping me. |
77626e9 to
d8982b1
Compare
|
@gatorsmile @maropu Please have a look about this, solving the conflicts takes me some time. |
| val partitionColumnChanged = table.partitionColumnNames.contains(originColumn.name) | ||
|
|
||
| // Throw an AnalysisException if the type of partition column is changed. | ||
| if (typeChanged && partitionColumnChanged) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding a check here when user changing the type of partition columns.
|
Test build #93504 has finished for PR 19773 at commit
|
|
gentle ping @maropu, could you help to review this? I'll keep follow up this. |
| val originColumn = findColumnByName(table.dataSchema, columnName, resolver) | ||
| // Throw an AnalysisException if the column name/dataType is changed. | ||
| // Throw an AnalysisException if the column name is changed. | ||
| if (!columnEqual(originColumn, newColumn, resolver)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its ok to check names only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, not enough yet, add type compatible check in ef65c4d.
| throw new AnalysisException( | ||
| "ALTER TABLE CHANGE COLUMN is not supported for changing column " + | ||
| s"'${originColumn.name}' with type '${originColumn.dataType}' to " + | ||
| s"'${newColumn.name}' with type '${newColumn.dataType}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After add the type check, maybe we also need the type message in error message.
| // Ensure that changing partition column type throw exception | ||
| intercept[AnalysisException] { | ||
| sql("ALTER TABLE dbx.tab1 CHANGE COLUMN a a STRING") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please compare the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in ef65c4d. Also add check for type compatible check.
| private def addComment(column: StructField, comment: Option[String]): StructField = { | ||
| comment.map(column.withComment(_)).getOrElse(column) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we need data conversion (e.g., from ing to double?) in binary formats (parquet and orc)? Also, What happens if we get incompatible type changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for advise, I should also check the type compatible, add in ef65c4d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, we need to comply with the Hive behaivour. Is the current fix (by casting) the same with Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your question, actually that's also what I'm consider during do the compatible check. Hive do this column type change work in HiveAlterHandler and the detailed compatible check is in ColumnType. You can see in the ColumnType checking work, it actually use the canCast semantic to judge compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Thanks for the check. btw, have you checked if this could work correctly?
sql("""CREATE TABLE t(a INT, b STRING, c INT) using parquet""")
sql("""INSERT INTO t VALUES (1, 'a', 3)""")
sql("""ALTER TABLE t CHANGE a a STRING""")
spark.table("t").show
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///Users/maropu/Repositories/spark/spark-master/spark-warehouse/t/part-00000-93ddfd05-690a-480c-8cc5-fd0981206fc3-c000.snappy.parquet. Column: [a], Expected: string, Found: INT32
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:192)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.s
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, in this pr, we need an additional logic to cast input data into a changed type in catalog when reading....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advise!
I look into this in these days. With currently implement, all behavior comply with Hive(Support type change/Work well in non binary format/Exception in binary format like orc and parquet). Is it ok to add a config for constraint this?
The work of adding logic to cast input data into changed type in catalog may need modifying 4 parts logic including vectorized reader and row reader in parquet and orc. If we don't agree the currently behavior, I'll keep following these.
| Item | Behavior |
|---|---|
| Parquet Row Reader | ClassCastException in SpecificInternalRow.set${Type} |
| Parquet Vectorized Reader | SchemaColumnConvertNotSupportedException in VectorizedColumnReader.read${Type}Batch |
| Orc Row Reader | ClassCastException in OrcDeserializer.newWriter |
| Orc Vectorized Reader | NullPointerException in OrcColumnVector get value by type method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the check!. I think we don't always need to comply with the Hive behaivour and an understandable behaivour for users is the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @maropu .
|
Test build #95783 has finished for PR 19773 at commit
|
|
retest this please. |
|
Test build #95795 has finished for PR 19773 at commit
|
|
retest this please. |
|
Test build #95798 has finished for PR 19773 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @xuanyuanking
Thank you for contribution. This is a meaningful work for Apache Spark 2.5. I think we need some more improvements because the following is not true in Apache Spark.
With currently implement, all behavior comply with Hive(Support type change/Work well in non binary format/Exception in binary format like orc and parquet). Is it ok to add a config for constraint this?
Apache Spark already supports changing column types as a part of schema evolution. Especially, ORC vectorized reader support upcasting although it's not the same with canCast.
For the detail support Spark coverage, see SPARK-23007. It covered all built-in data source at that time.
Please note that every data sources have different capability. So, this PR needs to prevent ALTER TABLE CHANGE COLUMN for those data sources case-by-case. And, we need corresponding test cases.
|
@maropu @dongjoon-hyun Great thanks for your guidance ! Great thanks, I'll study these background soon. Got it, I'll keep following the cases in this PR, I roughly split these into 4 tasks and update the description of this PR firstly. I'll pay attention to the corresponding test cases in each task. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Support user to change column dataType in hive table and datasource table, also make sure the new changed data type can work with all data sources.
ALTER TABLE CHANGE COLUMNHow was this patch tested?
Add test case in
DDLSuite.scalaandSQLQueryTestSuite.scala