Skip to content

[SPARK-26176][SQL] Verify column names for CTAS with STORED AS#24075

Closed
sujith71955 wants to merge 4 commits intoapache:masterfrom
sujith71955:master_serde
Closed

[SPARK-26176][SQL] Verify column names for CTAS with STORED AS#24075
sujith71955 wants to merge 4 commits intoapache:masterfrom
sujith71955:master_serde

Conversation

@sujith71955
Copy link
Copy Markdown
Contributor

@sujith71955 sujith71955 commented Mar 12, 2019

What changes were proposed in this pull request?

Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.

BEFORE

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`")
Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1

AFTER

scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`")
 Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".

How was this patch tested?

Pass the Jenkins with the newly added test case.

@sujith71955 sujith71955 changed the title [SPARK-26176][SQL] Invalid column names validation is been added when we create a table using the Hive serde "STORED AS [SPARK-26176][SQL] Invalid column names validation is been added when we create a table using the Hive serde "STORED AS" Mar 12, 2019
@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 13, 2019

Test build #103390 has finished for PR 24075 at commit e38e282.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

… we create a table using the Hive serde "STORED AS"

 ## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.

**BEFORE** `

scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")

Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
	at java.lang.Enum.valueOf(Enum.java:238)
	at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
	at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
	at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
	at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
	at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
	at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
	at

**AFTER** ```scala

CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;

2019-03-13 03:17:58 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
 Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
  at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58

## How was this patch tested?  Pass the Jenkins with a new test case.
@sujith71955
Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 13, 2019

Test build #103441 has finished for PR 24075 at commit 6d162fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Copy Markdown
Contributor Author

Test build #103390 has finished for PR 24075 at commit e38e282.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

seems to be random failure

@sujith71955
Copy link
Copy Markdown
Contributor Author

retest this please

@sujith71955
Copy link
Copy Markdown
Contributor Author

Above failure is not related to this PR. Please review and let me know for any suugestion. Thanks
cc @HyukjinKwon @dongjoon-hyun @cloud-fan

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 14, 2019

Test build #103473 has started for PR 24075 at commit 6d162fc.

@sujith71955
Copy link
Copy Markdown
Contributor Author

cc @gatorsmile


case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) =>
DDLUtils.checkDataColNames(tableDesc)
DDLUtils.checkDataColNames(tableDesc.copy(schema = query.schema))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it done for data source tables? by another rule?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in datasource table scenario, the flow will go through DataSourceAnalysis rule.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And moreover one more problem what i observed is in serde class name defined "parquet.hive.serde.ParquetHiveSerDe" in checkDataColNames() API, the serde name shall be "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" for parquet, so i added this serde also in above code as part of checkDataColNames() API.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we unify this check for both data source table and hive serde table?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let me check. thanks for your input.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are called from different rules, i will check how to unify

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled as part of PreprocessTableCreation rule for CTAS query, please review and let me know for any suggestions. Thanks

@dongjoon-hyun
Copy link
Copy Markdown
Member

dongjoon-hyun commented Mar 15, 2019

Thank you for pinging me, @sujith71955 .

  • I updated the PR description slightly and triggered a new testing since there was no successful run until now.
  • In addition, I update this JIRA as an Improvement since the previous and new behavior are just the same except raising the better exceptions for UX.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-26176][SQL] Invalid column names validation is been added when we create a table using the Hive serde "STORED AS" [SPARK-26176][SQL] Check invalid column names for CTAS with STORED AS Mar 15, 2019
@dongjoon-hyun
Copy link
Copy Markdown
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-26176][SQL] Check invalid column names for CTAS with STORED AS [SPARK-26176][SQL] Verify column names for CTAS with STORED AS Mar 15, 2019
@sujith71955
Copy link
Copy Markdown
Contributor Author

Thank you for pinging me, @sujith71955 .

  • I updated the PR description slightly and triggered a new testing since there was no successful run until now.
  • In addition, I update this JIRA as an Improvement since the previous and new behavior are just the same except raising the better exceptions for UX.

Sure. Thanks :)

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 15, 2019

Test build #103524 has finished for PR 24075 at commit 6d162fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

 ## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.

**BEFORE** `

scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")

Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
	at java.lang.Enum.valueOf(Enum.java:238)
	at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
	at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
	at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
	at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
	at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
	at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
	at

**AFTER** ```scala

CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;

2019-03-13 03:17:58 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
 Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
  at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58

## How was this patch tested?  Pass the Jenkins with a new test case.
@sujith71955
Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 16, 2019

Test build #103568 has finished for PR 24075 at commit b199894.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Copy Markdown
Contributor Author

Test build #103568 has finished for PR 24075 at commit b199894.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Seems to be failures not related to the PR

@sujith71955
Copy link
Copy Markdown
Contributor Author

retest this please

@dongjoon-hyun
Copy link
Copy Markdown
Member

Retest this please.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 17, 2019

Test build #103585 has finished for PR 24075 at commit b199894.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Copy Markdown
Contributor Author

Gentle ping @dongjoon-hyun @cloud-fan

val analyzedQuery = query.get
val normalizedTable = normalizeCatalogTable(analyzedQuery.schema, tableDesc)

DDLUtils.checkDataColNames(tableDesc.copy(schema = analyzedQuery.schema))

This comment was marked as resolved.

if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
isConvertible(tableDesc) && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_CTAS) =>
DDLUtils.checkDataColNames(tableDesc)
DDLUtils.checkDataColNames(tableDesc.copy(schema = query.schema))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call it here?

@sujith71955
Copy link
Copy Markdown
Contributor Author

sujith71955 commented Mar 18, 2019 via email

@cloud-fan
Copy link
Copy Markdown
Contributor

Yea let's unify that as well

@sujith71955
Copy link
Copy Markdown
Contributor Author

Yea let's unify that as well

sure

 ## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.

**BEFORE** `

scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")

Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
	at java.lang.Enum.valueOf(Enum.java:238)
	at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
	at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
	at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
	at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
	at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
	at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
	at

**AFTER** ```scala

CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;

2019-03-13 03:17:58 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
 Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
  at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58

## How was this patch tested?  Pass the Jenkins with a new test case.
@sujith71955
Copy link
Copy Markdown
Contributor Author

Retest this please.

@sujith71955
Copy link
Copy Markdown
Contributor Author

@cloud-fan @dongjoon-hyun Thanks for your valuable time and guidance :)

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 18, 2019

Test build #103617 has finished for PR 24075 at commit 02f2c19.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Ur, @sujith71955
The test failure is relevant to this PR. Could you check once more?

org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-21912 ORC/Parquet table should not create invalid column names

@sujith71955
Copy link
Copy Markdown
Contributor Author

Ur, @sujith71955
The test failure is relevant to this PR. Could you check once more?

org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-21912 ORC/Parquet table should not create invalid column names

Will check and update the PR. thanks

 ## What changes were proposed in this pull request?
Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
thus making compatible with error message shown while creating Parquet/ORC native table.

**BEFORE** `

scala> sql("CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1")

Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1
	at java.lang.Enum.valueOf(Enum.java:238)
	at parquet.schema.OriginalType.valueOf(OriginalType.java:21)
	at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:160)
	at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:111)
	at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:99)
	at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:92)
	at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:82)
	at

**AFTER** ```scala

CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;

2019-03-13 03:17:58 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
 Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=".
  at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:58

## How was this patch tested?  Pass the Jenkins with a new test case.
@sujith71955
Copy link
Copy Markdown
Contributor Author

Ur, @sujith71955
The test failure is relevant to this PR. Could you check once more?

org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-21912 ORC/Parquet table should not create invalid column names

Will check and update the PR. thanks

Issue got resolved, we cannot remove the column name validation logic from RelationConversion rule as this rule will run before 'PreprocessTableCreation, PreprocessTableInsertion, DataSourceAnalysisandHiveAnalysis` rules. so before converting the relation we shall do the validation as per the existing logic.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Mar 19, 2019

Test build #103660 has finished for PR 24075 at commit 8da74a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Copy Markdown
Contributor Author

@cloud-fan @dongjoon-hyun Testcases passed now, let me know for any clarifications/suggestions. thanks

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e402de5 Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants