Skip to content

Conversation

@wankunde
Copy link
Contributor

@wankunde wankunde commented Nov 3, 2022

What changes were proposed in this pull request?

Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable .
HiveExternalCatalog.alterTableStats() will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are.
HiveClientImpl.alterTable() will throw Bucket columns V1 is not part of the table columns exception when re-convert the CatalogTable to a HiveTable

Why are the changes needed?

Bug fix, refer to #32675 (comment)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Update exists UT

@github-actions github-actions bot added the SQL label Nov 3, 2022
@wankunde wankunde changed the title [SPARK-35531][SQL] Update hive table stats without unnecessary convert [WIP][SPARK-35531][SQL] Update hive table stats without unnecessary convert Nov 4, 2022
@wankunde wankunde force-pushed the write_stats_directly branch from d616d85 to 963bca9 Compare November 4, 2022 05:00
@wankunde wankunde changed the title [WIP][SPARK-35531][SQL] Update hive table stats without unnecessary convert [SPARK-35531][SQL] Update hive table stats without unnecessary convert Nov 5, 2022
@wankunde
Copy link
Contributor Author

wankunde commented Nov 5, 2022

Retest this please

@wankunde
Copy link
Contributor Author

wankunde commented Nov 5, 2022

@cloud-fan @AngersZhuuuu Could you help to review this PR ? Another PR #38496 depends on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stats changed after this pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR,when updating table stats, spark will convert CatalogTable to hive table;
After this PR, when updating table stats, spark will get the hive table from hive metastore and then update table stats. Refer to https://github.com/apache/spark/pull/38495/files#diff-45c9b065d76b237bcfecda83b8ee08c1ff6592d6f85acca09c0fa01472e056afR616-R618

Copy link
Contributor

@cloud-fan cloud-fan Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def alterTableStats(dbName: String, tableName: String, parameters: Map[String, String]): Unit
def alterTableProps(dbName: String, tableName: String, newProps: Map[String, String]): Unit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just call getRawHiveTable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse this UT to test update table stats code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it test anything? It just invokes alterTableStats but does no verification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can call client.getRawHiveTable, see 0942ea9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can call client.getRawHiveTable here, will throw exception java.lang.LinkageError: loader constraint violation: loader (instance of sun/misc/Launcher$AppClassLoader) previously initiated loading for a different type with name "org/apache/hadoop/hive/ql/metadata/Table"

Detail stack:

[info] org.apache.spark.sql.hive.execution.command.AlterTableDropPartitionSuite *** ABORTED *** (18 seconds, 552 milliseconds)
[info]   java.lang.LinkageError: loader constraint violation: loader (instance of sun/misc/Launcher$AppClassLoader) previously initiated loading for a different type with name "org/apache/hadoop/hive/ql/metadata/Table"
[info]   at java.lang.ClassLoader.defineClass1(Native Method)
[info]   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
[info]   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[info]   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
[info]   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
[info]   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
[info]   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
[info]   at java.security.AccessController.doPrivileged(Native Method)
[info]   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
[info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
[info]   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
[info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
[info]   at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1115)
[info]   at org.apache.spark.sql.hive.execution.V1WritesHiveUtils.getDynamicPartitionColumns(V1WritesHiveUtils.scala:51)
[info]   at org.apache.spark.sql.hive.execution.V1WritesHiveUtils.getDynamicPartitionColumns$(V1WritesHiveUtils.scala:43)
[info]   at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getDynamicPartitionColumns(InsertIntoHiveTable.scala:70)
[info]   at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.partitionColumns$lzycompute(InsertIntoHiveTable.scala:80)
[info]   at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.partitionColumns(InsertIntoHiveTable.scala:79)
[info]   at org.apache.spark.sql.execution.datasources.V1Writes$.org$apache$spark$sql$execution$datasources$V1Writes$$prepareQuery(V1Writes.scala:75)
[info]   at org.apache.spark.sql.execution.datasources.V1Writes$$anonfun$apply$1.applyOrElse(V1Writes.scala:57)
[info]   at org.apache.spark.sql.execution.datasources.V1Writes$$anonfun$apply$1.applyOrElse(V1Writes.scala:55)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
[info]   at org.apache.spark.sql.execution.datasources.V1Writes$.apply(V1Writes.scala:55)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should take RawHiveTable, so that we don't need to look up the table here.

@wankunde wankunde force-pushed the write_stats_directly branch from f69acca to 9108792 Compare November 15, 2022 15:23
@wankunde
Copy link
Contributor Author

Retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky to make HiveClient handle this STATISTICS_PREFIX. It should be the responsibility of HiveExternalCatalog. HiveClient should only take care of the communication with HMS.

@wankunde wankunde force-pushed the write_stats_directly branch from acb8a95 to be0c869 Compare November 21, 2022 10:50
// convert table statistics to properties so that we can persist them through hive client
val statsProperties =
val rawHiveTable = client.getRawHiveTable(db, table)
val oldProps = client.hiveTableProps(rawHiveTable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain the rationale?

@wankunde wankunde force-pushed the write_stats_directly branch from 6486103 to 3115a62 Compare November 22, 2022 02:30
Comment on lines 725 to 733
val oldProps =
client.hiveTableProps(rawHiveTable, containsStats = false)
.filterKeys(!_.startsWith(STATISTICS_PREFIX))
val newProps =
if (stats.isDefined) {
statsToProperties(stats.get)
oldProps ++ statsToProperties(stats.get)
} else {
new mutable.HashMap[String, String]()
oldProps
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan

  • Get all old properties from the hive table without table stats. (HiveStatisticsProperties is defined in HiveClientImpl, so do this in HiveClientImpl)
  • Filter out spark sql table and columns properties. (STATISTICS_PREFIX is defined in HiveExternalCatalog, so do this in HiveExternalCatalog)
  • Add the new table stats and then save the new table.

def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit

/** Get hive table properties. */
def hiveTableProps(rawHiveTable: RawHiveTable, containsStats: Boolean): Map[String, String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a method in RawHiveTable to do it?

private class RawHiveTableImpl(override val rawTable: HiveTable) extends RawHiveTable {
override lazy val toCatalogTable = convertHiveTableToCatalogTable(rawTable)

override def hiveTableProps(containsStats: Boolean): Map[String, String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need all hive table properties in some other places?

@wankunde wankunde force-pushed the write_stats_directly branch from 73be616 to bad2444 Compare November 22, 2022 09:51
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2513368 Nov 22, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable .
`HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will  store schema as lowercase  and keep bucket columns as they are.
`HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable

### Why are the changes needed?

Bug fix, refer to apache#32675 (comment)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Update exists UT

Closes apache#38495 from wankunde/write_stats_directly.

Authored-by: Kun Wan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
### What changes were proposed in this pull request?

Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable .
`HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will  store schema as lowercase  and keep bucket columns as they are.
`HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable

### Why are the changes needed?

Bug fix, refer to apache#32675 (comment)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Update exists UT

Closes apache#38495 from wankunde/write_stats_directly.

Authored-by: Kun Wan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable .
`HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will  store schema as lowercase  and keep bucket columns as they are.
`HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable

### Why are the changes needed?

Bug fix, refer to apache#32675 (comment)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Update exists UT

Closes apache#38495 from wankunde/write_stats_directly.

Authored-by: Kun Wan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants