-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20680][SQL] Spark-sql do not support for creating table with void column datatype #28833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fda353f
1e86674
3b8ddec
cf0db98
479901d
5aa4c1a
fdf57bf
de08967
d6f1a4b
98d12fe
31ba0bf
9ad57d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,6 +48,7 @@ class ResolveSessionCatalog( | |
| override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { | ||
| case AlterTableAddColumnsStatement( | ||
| nameParts @ SessionCatalogAndTable(catalog, tbl), cols) => | ||
| cols.foreach(c => failNullType(c.dataType)) | ||
| loadTable(catalog, tbl.asIdentifier).collect { | ||
| case v1Table: V1Table => | ||
| if (!DDLUtils.isHiveTable(v1Table.v1Table)) { | ||
|
|
@@ -76,6 +77,7 @@ class ResolveSessionCatalog( | |
|
|
||
| case AlterTableReplaceColumnsStatement( | ||
| nameParts @ SessionCatalogAndTable(catalog, tbl), cols) => | ||
| cols.foreach(c => failNullType(c.dataType)) | ||
| val changes: Seq[TableChange] = loadTable(catalog, tbl.asIdentifier) match { | ||
| case Some(_: V1Table) => | ||
| throw new AnalysisException("REPLACE COLUMNS is only supported with v2 tables.") | ||
|
|
@@ -100,6 +102,7 @@ class ResolveSessionCatalog( | |
|
|
||
| case a @ AlterTableAlterColumnStatement( | ||
| nameParts @ SessionCatalogAndTable(catalog, tbl), _, _, _, _, _) => | ||
| a.dataType.foreach(failNullType) | ||
| loadTable(catalog, tbl.asIdentifier).collect { | ||
| case v1Table: V1Table => | ||
| if (!DDLUtils.isHiveTable(v1Table.v1Table)) { | ||
|
|
@@ -268,6 +271,7 @@ class ResolveSessionCatalog( | |
| // session catalog and the table provider is not v2. | ||
| case c @ CreateTableStatement( | ||
| SessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => | ||
| assertNoNullTypeInSchema(c.tableSchema) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great if we could add a legacy flag for such behavior change in future. This changes the behavior for both v1 and v2 Catalogs in order to fix a compatibility issue with Hive Metastore. But Hive Metastore is not the only Catalog Spark supports since we have opened the Catalog APIs in DSv2.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know any database that supports creating tables with null/void type column, so this change is not for hive compatibility but for reasonable SQL semantic. I agree this is a breaking change that should be at least put in the migration guide. A legacy config can also be added but I can't find a reasonable use case for a null type column. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the main reason why you would want to support it is when people are using tables / views / temp tables to structure existing workloads. We support NullType type in CTEs, but in the case where people want to reuse the same CTE in multiple queries (i.e., multi-output workloads), they have no choice but to use views or temporary tables. (With DataFrames they'd still be able to reuse the same dataframe for multiple outputs, but in SQL that doesn't work.) One typical use case where you use CTEs to structure your code is if you have multiple sources with different structures that you then UNION ALL together into a single dataset. It is not uncommon for each of the sources to have certain columns that don't apply, and then you write explicit NULLs there. It would be pretty annoying if you had to write explicit casts of those NULLs to the right type in all of those cases.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bart-samwel this makes sense, shall we also support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @LantaoJin do you have time to fix it? I think we can simply remove the null type check and add a few tests with both in-memory and hive catalog. |
||
| val provider = c.provider.getOrElse(conf.defaultDataSourceName) | ||
| if (!isV2Provider(provider)) { | ||
| if (!DDLUtils.isHiveTable(Some(provider))) { | ||
|
|
@@ -292,6 +296,9 @@ class ResolveSessionCatalog( | |
|
|
||
| case c @ CreateTableAsSelectStatement( | ||
| SessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _, _) => | ||
| if (c.asSelect.resolved) { | ||
| assertNoNullTypeInSchema(c.asSelect.schema) | ||
| } | ||
| val provider = c.provider.getOrElse(conf.defaultDataSourceName) | ||
| if (!isV2Provider(provider)) { | ||
| val tableDesc = buildCatalogTable(tbl.asTableIdentifier, new StructType, | ||
|
|
@@ -319,6 +326,7 @@ class ResolveSessionCatalog( | |
| // session catalog and the table provider is not v2. | ||
| case c @ ReplaceTableStatement( | ||
| SessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _) => | ||
| assertNoNullTypeInSchema(c.tableSchema) | ||
LantaoJin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| val provider = c.provider.getOrElse(conf.defaultDataSourceName) | ||
| if (!isV2Provider(provider)) { | ||
| throw new AnalysisException("REPLACE TABLE is only supported with v2 tables.") | ||
|
|
@@ -336,6 +344,9 @@ class ResolveSessionCatalog( | |
|
|
||
| case c @ ReplaceTableAsSelectStatement( | ||
| SessionCatalogAndTable(catalog, tbl), _, _, _, _, _, _, _, _, _, _) => | ||
| if (c.asSelect.resolved) { | ||
| assertNoNullTypeInSchema(c.asSelect.schema) | ||
| } | ||
| val provider = c.provider.getOrElse(conf.defaultDataSourceName) | ||
| if (!isV2Provider(provider)) { | ||
| throw new AnalysisException("REPLACE TABLE AS SELECT is only supported with v2 tables.") | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.catalog._ | |
| import org.apache.spark.sql.catalyst.expressions.{Expression, InputFileBlockLength, InputFileBlockStart, InputFileName, RowOrdering} | ||
| import org.apache.spark.sql.catalyst.plans.logical._ | ||
| import org.apache.spark.sql.catalyst.rules.Rule | ||
| import org.apache.spark.sql.connector.catalog.CatalogV2Util.assertNoNullTypeInSchema | ||
| import org.apache.spark.sql.connector.expressions.{FieldReference, RewritableTransform} | ||
| import org.apache.spark.sql.execution.command.DDLUtils | ||
| import org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2 | ||
|
|
@@ -292,6 +293,8 @@ case class PreprocessTableCreation(sparkSession: SparkSession) extends Rule[Logi | |
| "in the table definition of " + table.identifier, | ||
| sparkSession.sessionState.conf.caseSensitiveAnalysis) | ||
|
|
||
| assertNoNullTypeInSchema(schema) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed? I think the changes in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Without this, "CREATE TABLE t1 USING PARQUET AS SELECT null as null_col" in Spark will throw Comparing the error message from Hive
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry, above description is incorrect. Without this, CTAS for Hive table Seems Hive table (non-parquet/orc format) doesn't go through |
||
|
|
||
| val normalizedPartCols = normalizePartitionColumns(schema, table) | ||
| val normalizedBucketSpec = normalizeBucketSpec(schema, table) | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.