[SPARK-26176][SQL] Verify column names for CTAS with `STORED AS` by sujith71955 · Pull Request #24075 · apache/spark

sujith71955 · 2019-03-12T22:18:55Z

What changes were proposed in this pull request?

Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.

BEFORE

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`")
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1

AFTER

scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`")
 Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".

How was this patch tested?

Pass the Jenkins with the newly added test case.

SparkQA · 2019-03-13T00:56:38Z

Test build #103390 has finished for PR 24075 at commit e38e282.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… we create a table using the Hive serde "STORED AS" ## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. **BEFORE** ` scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 at java.lang.Enum.valueOf(Enum.java:238) at parquet.schema.OriginalType.valueOf(OriginalType.java:21) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82) at **AFTER** ```scala CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; 2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58 ## How was this patch tested? Pass the Jenkins with a new test case.

sujith71955 · 2019-03-13T15:01:21Z

retest this please

SparkQA · 2019-03-13T19:56:02Z

Test build #103441 has finished for PR 24075 at commit 6d162fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-03-14T02:23:25Z

Test build #103390 has finished for PR 24075 at commit e38e282.

This patch fails Spark unit tests.

This patch merges cleanly.

This patch adds no public classes.

seems to be random failure

sujith71955 · 2019-03-14T02:23:41Z

retest this please

sujith71955 · 2019-03-14T02:32:54Z

Above failure is not related to this PR. Please review and let me know for any suugestion. Thanks
cc @HyukjinKwon @dongjoon-hyun @cloud-fan

SparkQA · 2019-03-14T02:33:59Z

Test build #103473 has started for PR 24075 at commit 6d162fc.

sujith71955 · 2019-03-14T05:57:40Z

cc @gatorsmile

cloud-fan · 2019-03-14T12:30:05Z


    case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) =>
-      DDLUtils.checkDataColNames(tableDesc)
+      DDLUtils.checkDataColNames(tableDesc.copy(schema = query.schema))


How is it done for data source tables? by another rule?

Yes in datasource table scenario, the flow will go through DataSourceAnalysis rule.

And moreover one more problem what i observed is in serde class name defined "parquet.hive.serde.ParquetHiveSerDe" in checkDataColNames() API, the serde name shall be "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" for parquet, so i added this serde also in above code as part of checkDataColNames() API.

can we unify this check for both data source table and hive serde table?

sure, let me check. thanks for your input.

Both are called from different rules, i will check how to unify

Handled as part of PreprocessTableCreation rule for CTAS query, please review and let me know for any suggestions. Thanks

dongjoon-hyun · 2019-03-15T00:00:05Z

Thank you for pinging me, @sujith71955 .

I updated the PR description slightly and triggered a new testing since there was no successful run until now.
In addition, I update this JIRA as an Improvement since the previous and new behavior are just the same except raising the better exceptions for UX.

dongjoon-hyun · 2019-03-15T00:04:24Z

Retest this please.

sujith71955 · 2019-03-15T02:48:57Z

Thank you for pinging me, @sujith71955 .

I updated the PR description slightly and triggered a new testing since there was no successful run until now.

In addition, I update this JIRA as an Improvement since the previous and new behavior are just the same except raising the better exceptions for UX.

Sure. Thanks :)

SparkQA · 2019-03-15T04:31:57Z

Test build #103524 has finished for PR 24075 at commit 6d162fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. **BEFORE** ` scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 at java.lang.Enum.valueOf(Enum.java:238) at parquet.schema.OriginalType.valueOf(OriginalType.java:21) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82) at **AFTER** ```scala CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; 2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58 ## How was this patch tested? Pass the Jenkins with a new test case.

sujith71955 · 2019-03-16T11:19:24Z

retest this please

SparkQA · 2019-03-16T19:35:22Z

Test build #103568 has finished for PR 24075 at commit b199894.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-03-17T14:29:57Z

Test build #103568 has finished for PR 24075 at commit b199894.

This patch fails Spark unit tests.

This patch merges cleanly.

This patch adds no public classes.

Seems to be failures not related to the PR

sujith71955 · 2019-03-17T14:30:09Z

retest this please

dongjoon-hyun · 2019-03-17T16:51:02Z

Retest this please.

SparkQA · 2019-03-17T21:39:20Z

Test build #103585 has finished for PR 24075 at commit b199894.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-03-18T06:03:43Z

Gentle ping @dongjoon-hyun @cloud-fan

Sign in to view

        val analyzedQuery = query.get
        val normalizedTable = normalizeCatalogTable(analyzedQuery.schema, tableDesc)

+        DDLUtils.checkDataColNames(tableDesc.copy(schema = analyzedQuery.schema))


cloud-fan · 2019-03-18T11:03:46Z

          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
            isConvertible(tableDesc) && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_CTAS) =>
-        DDLUtils.checkDataColNames(tableDesc)
+        DDLUtils.checkDataColNames(tableDesc.copy(schema = query.schema))


Do we need to call it here?

sujith71955 · 2019-03-18T11:09:39Z

Nope, i unified this for ctas queries, if non ctas query still flow is same, means the validation will be done from HiveAnalysis and DatasourceAnalysis rule. Shall we unify that layer also? Let me know. Thanks

…

On Mon, 18 Mar 2019 at 4:33 PM, Wenchen Fan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala <#24075 (comment)>: > @@ -206,6 +206,8 @@ case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[Logi val analyzedQuery = query.get val normalizedTable = normalizeCatalogTable(analyzedQuery.schema, tableDesc) + DDLUtils.checkDataColNames(tableDesc.copy(schema = analyzedQuery.schema)) did we call this in the else branch? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24075 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMZZ-ckORUozuZixUkdWSCUbDBTw9doCks5vX3KWgaJpZM4br_0p> .

cloud-fan · 2019-03-18T12:07:09Z

Yea let's unify that as well

sujith71955 · 2019-03-18T12:25:02Z

Yea let's unify that as well

sure

## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. **BEFORE** ` scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 at java.lang.Enum.valueOf(Enum.java:238) at parquet.schema.OriginalType.valueOf(OriginalType.java:21) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82) at **AFTER** ```scala CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; 2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58 ## How was this patch tested? Pass the Jenkins with a new test case.

sujith71955 · 2019-03-18T13:00:18Z

Retest this please.

sujith71955 · 2019-03-18T13:18:10Z

@cloud-fan @dongjoon-hyun Thanks for your valuable time and guidance :)

SparkQA · 2019-03-18T19:04:09Z

Test build #103617 has finished for PR 24075 at commit 02f2c19.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-18T19:27:54Z

Ur, @sujith71955
The test failure is relevant to this PR. Could you check once more?

org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-21912 ORC/Parquet table should not create invalid column names

sujith71955 · 2019-03-19T01:46:59Z

Ur, @sujith71955
The test failure is relevant to this PR. Could you check once more?
org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-21912 ORC/Parquet table should not create invalid column names

Will check and update the PR. thanks

## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. **BEFORE** ` scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 at java.lang.Enum.valueOf(Enum.java:238) at parquet.schema.OriginalType.valueOf(OriginalType.java:21) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82) at **AFTER** ```scala CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; 2019-03-13 03:17:58 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58 ## How was this patch tested? Pass the Jenkins with a new test case.

sujith71955 · 2019-03-19T07:01:04Z

Ur, @sujith71955
The test failure is relevant to this PR. Could you check once more?
org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-21912 ORC/Parquet table should not create invalid column names
Will check and update the PR. thanks

Issue got resolved, we cannot remove the column name validation logic from RelationConversion rule as this rule will run before 'PreprocessTableCreation, PreprocessTableInsertion, DataSourceAnalysisandHiveAnalysis` rules. so before converting the relation we shall do the validation as per the existing logic.

SparkQA · 2019-03-19T11:30:29Z

Test build #103660 has finished for PR 24075 at commit 8da74a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-03-19T11:50:50Z

@cloud-fan @dongjoon-hyun Testcases passed now, let me know for any clarifications/suggestions. thanks

cloud-fan · 2019-03-19T12:29:58Z

thanks, merging to master!

sujith71955 changed the title ~~[SPARK-26176][SQL] Invalid column names validation is been added when we create a table using the Hive serde "STORED AS~~ [SPARK-26176][SQL] Invalid column names validation is been added when we create a table using the Hive serde "STORED AS" Mar 12, 2019

sujith71955 force-pushed the master_serde branch from e38e282 to 6d162fc Compare March 13, 2019 13:28

cloud-fan reviewed Mar 14, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-26176][SQL] Invalid column names validation is been added when we create a table using the Hive serde "STORED AS"~~ [SPARK-26176][SQL] Check invalid column names for CTAS with STORED AS Mar 15, 2019

dongjoon-hyun changed the title ~~[SPARK-26176][SQL] Check invalid column names for CTAS with STORED AS~~ [SPARK-26176][SQL] Verify column names for CTAS with STORED AS Mar 15, 2019

cloud-fan reviewed Mar 18, 2019

View reviewed changes

cloud-fan approved these changes Mar 18, 2019

View reviewed changes

cloud-fan closed this in e402de5 Mar 19, 2019

Conversation

sujith71955 commented Mar 12, 2019 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 13, 2019

Uh oh!

sujith71955 commented Mar 13, 2019

Uh oh!

SparkQA commented Mar 13, 2019

Uh oh!

sujith71955 commented Mar 14, 2019

Uh oh!

sujith71955 commented Mar 14, 2019

Uh oh!

sujith71955 commented Mar 14, 2019

Uh oh!

SparkQA commented Mar 14, 2019

Uh oh!

sujith71955 commented Mar 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 15, 2019

Uh oh!

sujith71955 commented Mar 15, 2019

Uh oh!

SparkQA commented Mar 15, 2019

Uh oh!

sujith71955 commented Mar 16, 2019

Uh oh!

SparkQA commented Mar 16, 2019

Uh oh!

sujith71955 commented Mar 17, 2019

Uh oh!

sujith71955 commented Mar 17, 2019

Uh oh!

dongjoon-hyun commented Mar 17, 2019

Uh oh!

SparkQA commented Mar 17, 2019

Uh oh!

sujith71955 commented Mar 18, 2019

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sujith71955 commented Mar 18, 2019 via email

Uh oh!

cloud-fan commented Mar 18, 2019

Uh oh!

sujith71955 commented Mar 18, 2019

Uh oh!

sujith71955 commented Mar 18, 2019

Uh oh!

sujith71955 commented Mar 18, 2019

Uh oh!

SparkQA commented Mar 18, 2019

Uh oh!

dongjoon-hyun commented Mar 18, 2019

Uh oh!

sujith71955 commented Mar 19, 2019

Uh oh!

sujith71955 commented Mar 19, 2019

sujith71955 commented Mar 12, 2019 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented Mar 15, 2019 •

edited

Loading