[SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema #32675

opensky142857 · 2021-05-26T07:50:01Z

What changes were proposed in this pull request?

when convert to HiveTable, respect table schema cases.

Why are the changes needed?

When user create a hive bucket table with upper case schema, the table schema will be stored as lower cases while bucket column info will stay the same with user input.

if we try to insert into this table, an HiveException reports bucket column is not in table schema.

here is a simple repro

spark.sql("""
  CREATE TABLE TEST1(
    V1 BIGINT,
    S1 INT)
  PARTITIONED BY (PK BIGINT)
  CLUSTERED BY (V1)
  SORTED BY (S1)
  INTO 200 BUCKETS
  STORED AS PARQUET """).show

spark.sql("INSERT INTO TEST1 SELECT * FROM VALUES(1,1,1)").show

Error message:

scala> spark.sql("INSERT INTO TEST1 SELECT * FROM VALUES(1,1,1)").show
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), FieldSchema(name:s1, type:int, comment:null)]
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
  at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1242)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1166)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:103)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
  ... 47 elided
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), FieldSchema(name:s1, type:int, comment:null)]
  at org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552)
  at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1082)
  at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$1(HiveClientImpl.scala:732)
  at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
  at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:731)
  at org.apache.spark.sql.hive.client.HiveClient.getPartitions(HiveClient.scala:222)
  at org.apache.spark.sql.hive.client.HiveClient.getPartitions$(HiveClient.scala:218)
  at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:91)
  at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1245)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
  ... 69 more

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

…th upper case schema

AmplabJenkins · 2021-05-26T08:30:37Z

Can one of the admins verify this patch?

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala

wangyum · 2021-05-28T03:35:01Z

cc @cloud-fan @yaooqinn @AngersZhuuuu

cloud-fan · 2021-05-31T05:08:17Z

Can you post the full stacktrace? I'm a bit curious about how/where the error happens.

wangyum · 2021-05-31T05:43:24Z

@cloud-fan I have added the stacktrace to PR description.

cloud-fan · 2021-05-31T06:56:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

    table.bucketSpec match {
      case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) =>
        hiveTable.setNumBuckets(bucketSpec.numBuckets)
-        hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)


To clarify: hiveTable.setFields lower-cases the column names, but hiveTable.setBucketCols does not. And this causes the exception?

cloud-fan · 2021-05-31T06:57:42Z

the table schema will be stored as lower cases while bucket column info will stay the same with user input.

I'm not sure this is true. Table schema and bucketed columns are both stored in the Spark-specific table properties which are case-preserving.

opensky142857 · 2021-06-09T03:08:58Z

the table schema will be stored as lower cases while bucket column info will stay the same with user input.

I'm not sure this is true. Table schema and bucketed columns are both stored in the Spark-specific table properties which are case-preserving.

the schema here comes from metastore when we call methods like 'getRawTableOption' and the Spark schema info in properties don't overwrite the schema field in those cases.

cloud-fan · 2021-06-09T08:12:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

        if (bucketSpec.sortColumnNames.nonEmpty) {
          hiveTable.setSortCols(
-            bucketSpec.sortColumnNames
+            restoreHiveBucketSpecColNames(table.schema, bucketSpec.sortColumnNames)


Sorry I still can't understand how the bug happens.

In this toHiveTable method, the input CatalogTable should guarantee sanity: the partition/bucket column names should match the schema.

org/apache/spark/sql/hive/client/HiveClient.scala
final def getPartitions(
db: String,
table: String,
partialSpec: Option[TablePartitionSpec]): Seq[CatalogTablePartition] = {
getPartitions(getTable(db, table), partialSpec)
}

in this method, we get table from metasotre and pass it into getPartitions
and then call toHiveTable to convert a catalog table into HiveTable.

So there is a unnecessary hive table -> CatalogTable -> hive table convertion?

this convertion is not unnecessary since the hive client interface require a CatalogTable.

Can we change the hive client interface? For example

trait HiveClient { def getRawTableOption(dbName: String, tableName: String): Option[HiveTable] final def getTableOption(dbName: String, tableName: String): Option[CatalogTable] = { getRawTableOption(dbName, tableName).map(convertHiveTableToCatalogTable) } ... def getPartitionOption( table: HiveTable, spec: TablePartitionSpec): Option[CatalogTablePartition] = { getPartitions(getRawTable(db, table), partialSpec) } ... }

this is not the only place that have this issue. if we set spark.sql.statistics.size.autoUpdate.enabled=true, you can see this issue as well. for alter table, we have to do a catalogTable->HiveTable conversion

catalogTable->HiveTable is fine, as long as the catalogTable is correctly initialized. The problem I see here is, we get catalogTable by HiveClient.getTable which doesn't go through the intialization logic in HiveExternalCatalog

@AngersZhuuuu can you take this over?

@AngersZhuuuu can you take this over?

Sure.

github-actions · 2021-10-03T00:11:09Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to #32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes #38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to apache#32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes apache#38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-35531]Can not insert into hive bucket table if create table wi…

c47fcb4

…th upper case schema

opensky142857 changed the title ~~[SPARK-35531]Can not insert into hive bucket table if create table wi…~~ [SPARK-35531][SQL]Can not insert into hive bucket table if create table wi… May 26, 2021

Hongyi Zhang added 2 commits May 26, 2021 16:01

refine test case

c24fb61

refine sql

9a31b45

HyukjinKwon changed the title ~~[SPARK-35531][SQL]Can not insert into hive bucket table if create table wi…~~ [SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema May 26, 2021

github-actions bot added the SQL label May 26, 2021

wangyum reviewed May 28, 2021

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala Outdated Show resolved Hide resolved

wangyum reviewed May 28, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala Outdated Show resolved Hide resolved

refine code

3e27a7e

cloud-fan reviewed May 31, 2021

View reviewed changes

cloud-fan reviewed Jun 9, 2021

View reviewed changes

github-actions bot added the Stale label Oct 3, 2021

github-actions bot closed this Oct 4, 2021

AngersZhuuuu mentioned this pull request Oct 8, 2021

[SPARK-35531][SQL] Directly pass hive Table to HiveClient when call getPartitions to avoid unnecessary convert from HiveTable -> CatalogTable -> HiveTable #34218

Closed

wankunde mentioned this pull request Nov 3, 2022

[SPARK-35531][SQL] Update hive table stats without unnecessary convert #38495

Closed

[SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema #32675

[SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema #32675

Uh oh!

Conversation

opensky142857 commented May 26, 2021 • edited by wangyum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented May 26, 2021

Uh oh!

Uh oh!

Uh oh!

wangyum commented May 28, 2021

Uh oh!

cloud-fan commented May 31, 2021

Uh oh!

wangyum commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 31, 2021

Uh oh!

opensky142857 commented Jun 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

opensky142857 commented May 26, 2021 •

edited by wangyum

Loading

wangyum commented May 31, 2021 •

edited

Loading

cloud-fan Jun 11, 2021 •

edited

Loading