[SPARK-35531][SQL] Update hive table stats without unnecessary convert #38495

wankunde · 2022-11-03T09:57:20Z

What changes were proposed in this pull request?

Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable .
HiveExternalCatalog.alterTableStats() will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are.
HiveClientImpl.alterTable() will throw Bucket columns V1 is not part of the table columns exception when re-convert the CatalogTable to a HiveTable

Why are the changes needed?

Bug fix, refer to #32675 (comment)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Update exists UT

wankunde · 2022-11-05T04:22:45Z

Retest this please

wankunde · 2022-11-05T04:27:14Z

@cloud-fan @AngersZhuuuu Could you help to review this PR ? Another PR #38496 depends on this.

wangyum · 2022-11-11T12:56:09Z

...rc/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala

Why stats changed after this pr?

Before this PR，when updating table stats, spark will convert CatalogTable to hive table;
After this PR, when updating table stats, spark will get the hive table from hive metastore and then update table stats. Refer to https://github.com/apache/spark/pull/38495/files#diff-45c9b065d76b237bcfecda83b8ee08c1ff6592d6f85acca09c0fa01472e056afR616-R618

cloud-fan · 2022-11-14T04:43:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala

Suggested change

def alterTableStats(dbName: String, tableName: String, parameters: Map[String, String]): Unit

def alterTableProps(dbName: String, tableName: String, newProps: Map[String, String]): Unit

cloud-fan · 2022-11-14T04:45:28Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

we can just call getRawHiveTable

cloud-fan · 2022-11-14T04:46:05Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala

what are we doing here?

Reuse this UT to test update table stats code

Does it test anything? It just invokes alterTableStats but does no verification.

Remove this change

cloud-fan · 2022-11-14T04:48:37Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

we can call client.getRawHiveTable, see 0942ea9

If we can call client.getRawHiveTable here， will throw exception java.lang.LinkageError: loader constraint violation: loader (instance of sun/misc/Launcher$AppClassLoader) previously initiated loading for a different type with name "org/apache/hadoop/hive/ql/metadata/Table"

Detail stack:

[info] org.apache.spark.sql.hive.execution.command.AlterTableDropPartitionSuite *** ABORTED *** (18 seconds, 552 milliseconds) [info] java.lang.LinkageError: loader constraint violation: loader (instance of sun/misc/Launcher$AppClassLoader) previously initiated loading for a different type with name "org/apache/hadoop/hive/ql/metadata/Table" [info] at java.lang.ClassLoader.defineClass1(Native Method) [info] at java.lang.ClassLoader.defineClass(ClassLoader.java:756) [info] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) [info] at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) [info] at java.net.URLClassLoader.access$100(URLClassLoader.java:74) [info] at java.net.URLClassLoader$1.run(URLClassLoader.java:369) [info] at java.net.URLClassLoader$1.run(URLClassLoader.java:363) [info] at java.security.AccessController.doPrivileged(Native Method) [info] at java.net.URLClassLoader.findClass(URLClassLoader.java:362) [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) [info] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) [info] at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1115) [info] at org.apache.spark.sql.hive.execution.V1WritesHiveUtils.getDynamicPartitionColumns(V1WritesHiveUtils.scala:51) [info] at org.apache.spark.sql.hive.execution.V1WritesHiveUtils.getDynamicPartitionColumns$(V1WritesHiveUtils.scala:43) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getDynamicPartitionColumns(InsertIntoHiveTable.scala:70) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.partitionColumns$lzycompute(InsertIntoHiveTable.scala:80) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.partitionColumns(InsertIntoHiveTable.scala:79) [info] at org.apache.spark.sql.execution.datasources.V1Writes$.org$apache$spark$sql$execution$datasources$V1Writes$$prepareQuery(V1Writes.scala:75) [info] at org.apache.spark.sql.execution.datasources.V1Writes$$anonfun$apply$1.applyOrElse(V1Writes.scala:57) [info] at org.apache.spark.sql.execution.datasources.V1Writes$$anonfun$apply$1.applyOrElse(V1Writes.scala:55) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488) [info] at org.apache.spark.sql.execution.datasources.V1Writes$.apply(V1Writes.scala:55)

cloud-fan · 2022-11-15T09:11:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

This method should take RawHiveTable, so that we don't need to look up the table here.

wankunde · 2022-11-16T03:17:04Z

Retest this please

cloud-fan · 2022-11-16T07:29:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

It's a bit tricky to make HiveClient handle this STATISTICS_PREFIX. It should be the responsibility of HiveExternalCatalog. HiveClient should only take care of the communication with HMS.

cloud-fan · 2022-11-21T12:35:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

-    // convert table statistics to properties so that we can persist them through hive client
-    val statsProperties =
+    val rawHiveTable = client.getRawHiveTable(db, table)
+    val oldProps = client.hiveTableProps(rawHiveTable)


can you explain the rationale?

wankunde · 2022-11-22T02:36:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+    val oldProps =
+      client.hiveTableProps(rawHiveTable, containsStats = false)
+        .filterKeys(!_.startsWith(STATISTICS_PREFIX))
+    val newProps =
      if (stats.isDefined) {
-        statsToProperties(stats.get)
+        oldProps ++ statsToProperties(stats.get)
      } else {
-        new mutable.HashMap[String, String]()
+        oldProps
      }


@cloud-fan

Get all old properties from the hive table without table stats. (HiveStatisticsProperties is defined in HiveClientImpl, so do this in HiveClientImpl)

Filter out spark sql table and columns properties. (STATISTICS_PREFIX is defined in HiveExternalCatalog, so do this in HiveExternalCatalog)

Add the new table stats and then save the new table.

cloud-fan · 2022-11-22T02:47:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala

  def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit

+  /** Get hive table properties. */
+  def hiveTableProps(rawHiveTable: RawHiveTable, containsStats: Boolean): Map[String, String]


shall we add a method in RawHiveTable to do it?

cloud-fan · 2022-11-22T07:40:10Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

  private class RawHiveTableImpl(override val rawTable: HiveTable) extends RawHiveTable {
    override lazy val toCatalogTable = convertHiveTableToCatalogTable(rawTable)
+
+    override def hiveTableProps(containsStats: Boolean): Map[String, String] = {


why do we need this parameter?

I'm not sure if we need all hive table properties in some other places?

cloud-fan · 2022-11-22T14:15:38Z

thanks, merging to master!

### What changes were proposed in this pull request? Update hive table stats without convert from HiveTable -> CatalogTable -> HiveTable . `HiveExternalCatalog.alterTableStats()` will convert a raw HiveTable to CatalogTable which will store schema as lowercase and keep bucket columns as they are. `HiveClientImpl.alterTable()` will throw `Bucket columns V1 is not part of the table columns` exception when re-convert the CatalogTable to a HiveTable ### Why are the changes needed? Bug fix, refer to apache#32675 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UT Closes apache#38495 from wankunde/write_stats_directly. Authored-by: Kun Wan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Nov 3, 2022

wankunde changed the title ~~[SPARK-35531][SQL] Update hive table stats without unnecessary convert~~ [WIP][SPARK-35531][SQL] Update hive table stats without unnecessary convert Nov 4, 2022

wankunde force-pushed the write_stats_directly branch from d616d85 to 963bca9 Compare November 4, 2022 05:00

wankunde changed the title ~~[WIP][SPARK-35531][SQL] Update hive table stats without unnecessary convert~~ [SPARK-35531][SQL] Update hive table stats without unnecessary convert Nov 5, 2022

wangyum reviewed Nov 11, 2022

View reviewed changes

cloud-fan reviewed Nov 14, 2022

View reviewed changes

cloud-fan reviewed Nov 15, 2022

View reviewed changes

wankunde force-pushed the write_stats_directly branch from f69acca to 9108792 Compare November 15, 2022 15:23

cloud-fan reviewed Nov 16, 2022

View reviewed changes

wankunde added 6 commits November 21, 2022 16:19

[SPARK-35531][SQL] Update hive table stats without unecessary convert

e763d0a

Bug fix

12e1642

Fix UT

41890de

Update code

395b571

Update code

8760da9

Update code

be0c869

wankunde force-pushed the write_stats_directly branch from acb8a95 to be0c869 Compare November 21, 2022 10:50

cloud-fan reviewed Nov 21, 2022

View reviewed changes

Bug fix

3115a62

wankunde force-pushed the write_stats_directly branch from 6486103 to 3115a62 Compare November 22, 2022 02:30

wankunde commented Nov 22, 2022

View reviewed changes

cloud-fan reviewed Nov 22, 2022

View reviewed changes

Add hiveTableProps method in hiveTableProps

ee08735

cloud-fan reviewed Nov 22, 2022

View reviewed changes

Use filterNot method instead of filterKeys method

6ebaa06

Remove containsStats parameter in hiveTableProps method

bad2444

wankunde force-pushed the write_stats_directly branch from 73be616 to bad2444 Compare November 22, 2022 09:51

cloud-fan approved these changes Nov 22, 2022

View reviewed changes

cloud-fan closed this in 2513368 Nov 22, 2022

	def alterTableStats(dbName: String, tableName: String, parameters: Map[String, String]): Unit
	def alterTableProps(dbName: String, tableName: String, newProps: Map[String, String]): Unit

[SPARK-35531][SQL] Update hive table stats without unnecessary convert #38495

[SPARK-35531][SQL] Update hive table stats without unnecessary convert #38495

Uh oh!

Conversation

wankunde commented Nov 3, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wankunde commented Nov 5, 2022

Uh oh!

wankunde commented Nov 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wankunde commented Nov 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloud-fan Nov 14, 2022 •

edited

Loading