-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema #32675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…th upper case schema
|
Can one of the admins verify this patch? |
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala
Outdated
Show resolved
Hide resolved
|
Can you post the full stacktrace? I'm a bit curious about how/where the error happens. |
|
@cloud-fan I have added the stacktrace to PR description. |
| table.bucketSpec match { | ||
| case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) => | ||
| hiveTable.setNumBuckets(bucketSpec.numBuckets) | ||
| hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: hiveTable.setFields lower-cases the column names, but hiveTable.setBucketCols does not. And this causes the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
I'm not sure this is true. Table schema and bucketed columns are both stored in the Spark-specific table properties which are case-preserving. |
the schema here comes from metastore when we call methods like 'getRawTableOption' and the Spark schema info in properties don't overwrite the schema field in those cases. |
| if (bucketSpec.sortColumnNames.nonEmpty) { | ||
| hiveTable.setSortCols( | ||
| bucketSpec.sortColumnNames | ||
| restoreHiveBucketSpecColNames(table.schema, bucketSpec.sortColumnNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still can't understand how the bug happens.
In this toHiveTable method, the input CatalogTable should guarantee sanity: the partition/bucket column names should match the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org/apache/spark/sql/hive/client/HiveClient.scala
final def getPartitions(
db: String,
table: String,
partialSpec: Option[TablePartitionSpec]): Seq[CatalogTablePartition] = {
getPartitions(getTable(db, table), partialSpec)
}
in this method, we get table from metasotre and pass it into getPartitions
and then call toHiveTable to convert a catalog table into HiveTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is a unnecessary hive table -> CatalogTable -> hive table convertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this convertion is not unnecessary since the hive client interface require a CatalogTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the hive client interface? For example
trait HiveClient {
def getRawTableOption(dbName: String, tableName: String): Option[HiveTable]
final def getTableOption(dbName: String, tableName: String): Option[CatalogTable] = {
getRawTableOption(dbName, tableName).map(convertHiveTableToCatalogTable)
}
...
def getPartitionOption(
table: HiveTable,
spec: TablePartitionSpec): Option[CatalogTablePartition] = {
getPartitions(getRawTable(db, table), partialSpec)
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the only place that have this issue. if we set spark.sql.statistics.size.autoUpdate.enabled=true, you can see this issue as well. for alter table, we have to do a catalogTable->HiveTable conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catalogTable->HiveTable is fine, as long as the catalogTable is correctly initialized. The problem I see here is, we get catalogTable by HiveClient.getTable which doesn't go through the intialization logic in HiveExternalCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AngersZhuuuu can you take this over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AngersZhuuuu can you take this over?
Sure.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to #32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes #38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to apache#32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes apache#38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to apache#32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes apache#38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to apache#32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes apache#38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
when convert to HiveTable, respect table schema cases.
Why are the changes needed?
When user create a hive bucket table with upper case schema, the table schema will be stored as lower cases while bucket column info will stay the same with user input.
if we try to insert into this table, an HiveException reports bucket column is not in table schema.
here is a simple repro
Error message:
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT