[SPARK-26176][SQL] Verify column names for CTAS with STORED AS#24075
[SPARK-26176][SQL] Verify column names for CTAS with STORED AS#24075sujith71955 wants to merge 4 commits intoapache:masterfrom
STORED AS#24075Conversation
|
Test build #103390 has finished for PR 24075 at commit
|
e38e282 to
6d162fc
Compare
… we create a table using the Hive serde "STORED AS"
## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.
**BEFORE** `
scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
at java.lang.Enum.valueOf(Enum.java:238)
at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
at
**AFTER** ```scala
CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58
## How was this patch tested? Pass the Jenkins with a new test case.
|
retest this please |
|
Test build #103441 has finished for PR 24075 at commit
|
seems to be random failure |
|
retest this please |
|
Above failure is not related to this PR. Please review and let me know for any suugestion. Thanks |
|
Test build #103473 has started for PR 24075 at commit |
|
cc @gatorsmile |
|
|
||
| case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => | ||
| DDLUtils.checkDataColNames(tableDesc) | ||
| DDLUtils.checkDataColNames(tableDesc.copy(schema = query.schema)) |
There was a problem hiding this comment.
How is it done for data source tables? by another rule?
There was a problem hiding this comment.
Yes in datasource table scenario, the flow will go through DataSourceAnalysis rule.
There was a problem hiding this comment.
And moreover one more problem what i observed is in serde class name defined "parquet.hive.serde.ParquetHiveSerDe" in checkDataColNames() API, the serde name shall be "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" for parquet, so i added this serde also in above code as part of checkDataColNames() API.
There was a problem hiding this comment.
can we unify this check for both data source table and hive serde table?
There was a problem hiding this comment.
sure, let me check. thanks for your input.
There was a problem hiding this comment.
Both are called from different rules, i will check how to unify
There was a problem hiding this comment.
Handled as part of PreprocessTableCreation rule for CTAS query, please review and let me know for any suggestions. Thanks
|
Thank you for pinging me, @sujith71955 .
|
STORED AS
|
Retest this please. |
STORED ASSTORED AS
Sure. Thanks :) |
|
Test build #103524 has finished for PR 24075 at commit
|
## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.
**BEFORE** `
scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
at java.lang.Enum.valueOf(Enum.java:238)
at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
at
**AFTER** ```scala
CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58
## How was this patch tested? Pass the Jenkins with a new test case.
|
retest this please |
|
Test build #103568 has finished for PR 24075 at commit
|
Seems to be failures not related to the PR |
|
retest this please |
|
Retest this please. |
|
Test build #103585 has finished for PR 24075 at commit
|
|
Gentle ping @dongjoon-hyun @cloud-fan |
| val analyzedQuery = query.get | ||
| val normalizedTable = normalizeCatalogTable(analyzedQuery.schema, tableDesc) | ||
|
|
||
| DDLUtils.checkDataColNames(tableDesc.copy(schema = analyzedQuery.schema)) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty && | ||
| isConvertible(tableDesc) && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_CTAS) => | ||
| DDLUtils.checkDataColNames(tableDesc) | ||
| DDLUtils.checkDataColNames(tableDesc.copy(schema = query.schema)) |
There was a problem hiding this comment.
Do we need to call it here?
|
Nope, i unified this for ctas queries, if non ctas query still flow is
same, means the validation will be done from HiveAnalysis and
DatasourceAnalysis rule. Shall we unify that layer also? Let me know.
Thanks
…On Mon, 18 Mar 2019 at 4:33 PM, Wenchen Fan ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala
<#24075 (comment)>:
> @@ -206,6 +206,8 @@ case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[Logi
val analyzedQuery = query.get
val normalizedTable = normalizeCatalogTable(analyzedQuery.schema, tableDesc)
+ DDLUtils.checkDataColNames(tableDesc.copy(schema = analyzedQuery.schema))
did we call this in the else branch?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24075 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMZZ-ckORUozuZixUkdWSCUbDBTw9doCks5vX3KWgaJpZM4br_0p>
.
|
|
Yea let's unify that as well |
sure |
## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.
**BEFORE** `
scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
at java.lang.Enum.valueOf(Enum.java:238)
at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
at
**AFTER** ```scala
CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58
## How was this patch tested? Pass the Jenkins with a new test case.
|
Retest this please. |
|
@cloud-fan @dongjoon-hyun Thanks for your valuable time and guidance :) |
|
Test build #103617 has finished for PR 24075 at commit
|
|
Ur, @sujith71955 |
Will check and update the PR. thanks |
## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.
**BEFORE** `
scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
at java.lang.Enum.valueOf(Enum.java:238)
at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
at
**AFTER** ```scala
CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58
## How was this patch tested? Pass the Jenkins with a new test case.
Issue got resolved, we cannot remove the column name validation logic from RelationConversion rule as this rule will run before 'PreprocessTableCreation |
|
Test build #103660 has finished for PR 24075 at commit
|
|
@cloud-fan @dongjoon-hyun Testcases passed now, let me know for any clarifications/suggestions. thanks |
|
thanks, merging to master! |
What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.
BEFORE
AFTER
How was this patch tested?
Pass the Jenkins with the newly added test case.