[SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI #36625

wangyum · 2022-05-21T10:12:52Z

What changes were proposed in this pull request?

Rewrite table location to absolute location based on database location. For example:

Table location: /user/hive/warehouse/db_spark_39203.db/t1
Database location: viewfs://clusterA/user/hive/warehouse/db_spark_39203.db/

The current table location is: viewfs://clusterA/user/hive/warehouse/db_spark_39203.db/t1.

Why are the changes needed?

The old Spark version(before SPARK-19257) will not use absolute path when using the following SQL create table:

CREATE TABLE DB_SPARK_39203.t6(id int)
USING parquet
OPTIONS (
  path '/user/hive/warehouse/db_spark_39203.db/t1'
);

This issue makes Spark can't read tables across cluster. For example:
Hive, Spark 2.1 and HDFS 1 on cluster A.
Spark 3.0 and HDFS 2 on cluster B.

The other create table command and latest spark do not have this issue:

CREATE TABLE DB_SPARK_39203.t7(id int) using parquet;

CREATE TABLE DB_SPARK_39203.t8(id int)
stored as parquet
LOCATION '/user/hive/warehouse/db_spark_39203.db/t1';

CREATE EXTERNAL TABLE DB_SPARK_39203.t9(id int)
stored as parquet
LOCATION '/user/hive/warehouse/db_spark_39203.db/t1';

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and production test: we have been using this patch for over two years.

wangyum · 2022-05-25T11:09:56Z

@cloud-fan

cloud-fan · 2022-05-26T04:24:41Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

table.storage.locationUri is the table location, not database location, isn't it?

The table location uri has been rewritten by HiveClientImpl.convertHiveTableToCatalogTable.
For example:

table.storage.locationUri: Some(viewfs://clusterA/user/hive/warehouse/db_spark_39203.db/t1) loc: /user/hive/warehouse/db_spark_39203.db/t1

Sorry I don't understand. The previous code completely ignores table.storage.locationUri. Do you mean we still don't care about table.storage.locationUri, but only want to get its qualifier (scheme, host, port, etc.)?

Yes. We only rewrite table.storage.locationUri once with HiveClientImpl.convertHiveTableToCatalogTable. Then use the rewritten uri to rewrite the other location uri: CaseInsensitiveMap(table.storage.properties).get("path"), the partition location.

cloud-fan · 2022-05-26T08:46:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

It seems to be that it's clearer to get database location in HiveExternalCatalog.restoreDataSourceTable and do not change HiveClientImpl at all.

It can't handle all cases. For example rawTable's location is incorrect:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Lines 1283 to 1289 in 4df8512

val rawTable = getRawTable(db, table)

val catalogTable = restoreTableMetadata(rawTable)

val partColNameMap = buildLowerCasePartColNameMap(catalogTable)

val clientPrunedPartitions =

client.getPartitionsByFilter(rawTable, predicates).map { part =>

part.copy(spec = restorePartitionSpec(part.spec, partColNameMap))

}

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Lines 761 to 769 in 4df8512

override def getPartitionsByFilter(

table: CatalogTable,

predicates: Seq[Expression]): Seq[CatalogTablePartition] = withHiveState {

val hiveTable = toHiveTable(table, Some(userName))

val parts = shim.getPartitionsByFilter(client, hiveTable, predicates, table)

.map(fromHivePartition)

HiveCatalogMetrics.incrementFetchedPartitions(parts.length)

parts

}

This makes fromHivePartition can't correct the partition location:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Line 1157 in 4df8512

locationUri = Option(CatalogUtils.stringToURI(apiPartition.getSd.getLocation)),

cloud-fan · 2022-05-27T07:02:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

for tables created by recent versions of Spark, this condition is always true?

and the other command:
spark-sql> > CREATE TABLE DB_SPARK_39203.t7(id int) using parquet; Time taken: 0.43 seconds spark-sql> > CREATE TABLE DB_SPARK_39203.t8(id int) > stored as parquet > LOCATION '/user/hive/warehouse/db_spark_39203.db/t1'; Time taken: 0.265 seconds spark-sql> > CREATE EXTERNAL TABLE DB_SPARK_39203.t9(id int) > stored as parquet > LOCATION '/user/hive/warehouse/db_spark_39203.db/t1'; Time taken: 0.278 seconds spark-sql> desc formatted DB_SPARK_39203.t7; id int NULL # Detailed Table Information Database db_spark_39203 Table t7 Owner #### Created Time Fri May 27 00:31:09 GMT-07:00 2022 Last Access UNKNOWN Created By Spark 3.2.0-SNAPSHOT Type MANAGED Provider parquet Location viewfs://####/user/hive/warehouse/db_spark_39203.db/t7 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Time taken: 0.152 seconds, Fetched 15 row(s) spark-sql> desc formatted DB_SPARK_39203.t8; id int NULL # Detailed Table Information Database db_spark_39203 Table t8 Owner #### Created Time Fri May 27 00:31:09 GMT-07:00 2022 Last Access UNKNOWN Created By Spark 3.2.0-SNAPSHOT Type EXTERNAL Provider hive Table Properties [transient_lastDdlTime=1653636669] Location viewfs://####/user/hive/warehouse/db_spark_39203.db/t1 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.224 seconds, Fetched 18 row(s) spark-sql> desc formatted DB_SPARK_39203.t9; id int NULL # Detailed Table Information Database db_spark_39203 Table t9 Owner #### Created Time Fri May 27 00:31:10 GMT-07:00 2022 Last Access UNKNOWN Created By Spark 3.2.0-SNAPSHOT Type EXTERNAL Provider hive Table Properties [transient_lastDdlTime=1653636670] Location viewfs://####/user/hive/warehouse/db_spark_39203.db/t1 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 0.142 seconds, Fetched 18 row(s)

cloud-fan · 2022-05-27T07:03:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

shoudn't this be CatalogUtils.stringToURI(client.getDatabase(h.getDbName).getLocationUri)?

cloud-fan · 2022-05-27T07:05:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

nit:

if (!uri.isAbsolute && parentUri.isDefined) { new URI... } else { uri }

cloud-fan · 2022-05-27T07:06:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

I think the name parentUri is a bit misleading. It's more like an absoluteUri which we need to inherit its schema, host, port, etc.

Renamed it to absoluteUri.

…abase location

[info] - 0.12: getPartitionNames(catalogTable) (74 milliseconds) [info] org.apache.spark.sql.hive.client.HiveClientSuites *** ABORTED *** (13 seconds, 434 milliseconds) [info] java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Table.getDataLocation()Lorg/apache/hadoop/fs/Path; [info] at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$3(HiveClientImpl.scala:765)

cloud-fan · 2022-05-30T14:31:32Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

      storage = CatalogStorageFormat(
-        locationUri = shim.getDataLocation(h).map(CatalogUtils.stringToURI),
+        locationUri = shim.getDataLocation(h).map { loc =>
+          val tableUri = stringToURI(loc)


can we add some code comments to explain this backward compatibility story?

cloud-fan · 2022-05-30T14:33:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+    val absoluteUri = shim.getDataLocation(hiveTable).map(stringToURI).filter(!_.isAbsolute)
+      .map(_ => stringToURI(client.getDatabase(hiveTable.getDbName).getLocationUri))
+    val parts = shim.getPartitions(client, hiveTable, partSpec.asJava)
+      .map(fromHivePartition(_, absoluteUri))


does this mean we always calculate the absoluteUri even if the partition uri is absolute?

No. the absoluteUri is None if table location is absolute uri.

cloud-fan · 2022-05-30T14:34:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

    val storageWithLocation = {
-      val tableLocation = getLocationFromStorageProps(table)
+      val tableLocation = getLocationFromStorageProps(table).map { path =>
+        toAbsoluteURI(CatalogUtils.stringToURI(path), table.storage.locationUri)


ditto, let's explain the backward compatibility story with code comments.

wangyum · 2022-05-31T10:11:13Z

Merged to master.

### What changes were proposed in this pull request? This fixes a corner-case regression caused by #36625. Users may have existing views that have invalid locations due to historical reasons. The location is actually useless for a view, but after #36625 , they start to fail to read the view as qualifying the location fails. We should just skip qualifying view locations. ### Why are the changes needed? avoid regression ### Does this PR introduce _any_ user-facing change? Spark can read view with invalid location again. ### How was this patch tested? manually test. View with an invalid location is kind of "broken" and can't be dropped (HMS fails to drop it), so we can't write a UT for it. Closes #38321 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This fixes a corner-case regression caused by apache#36625. Users may have existing views that have invalid locations due to historical reasons. The location is actually useless for a view, but after apache#36625 , they start to fail to read the view as qualifying the location fails. We should just skip qualifying view locations. ### Why are the changes needed? avoid regression ### Does this PR introduce _any_ user-facing change? Spark can read view with invalid location again. ### How was this patch tested? manually test. View with an invalid location is kind of "broken" and can't be dropped (HMS fails to drop it), so we can't write a UT for it. Closes apache#38321 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…d on database URI ### What changes were proposed in this pull request? This reverts #36625 and its followup #38321 . ### Why are the changes needed? External table location can be arbitrary and has no connection with the database location. It can be wrong to qualify the external table location based on the database location. If a table written by old Spark versions does not have a qualified location, there is no way to restore it as the information is already lost. People can manually fix the table locations assuming they are under the same HDFS cluster with the database location, by themselves. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #40871 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…d on database URI ### What changes were proposed in this pull request? This reverts #36625 and its followup #38321 . ### Why are the changes needed? External table location can be arbitrary and has no connection with the database location. It can be wrong to qualify the external table location based on the database location. If a table written by old Spark versions does not have a qualified location, there is no way to restore it as the information is already lost. People can manually fix the table locations assuming they are under the same HDFS cluster with the database location, by themselves. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #40871 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit afd9e2c) Signed-off-by: Hyukjin Kwon <[email protected]>

…d on database URI ### What changes were proposed in this pull request? This reverts apache#36625 and its followup apache#38321 . ### Why are the changes needed? External table location can be arbitrary and has no connection with the database location. It can be wrong to qualify the external table location based on the database location. If a table written by old Spark versions does not have a qualified location, there is no way to restore it as the information is already lost. People can manually fix the table locations assuming they are under the same HDFS cluster with the database location, by themselves. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes apache#40871 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit afd9e2c) Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added the SQL label May 21, 2022

wangyum force-pushed the SPARK-39203 branch 2 times, most recently from 65c1957 to 451e550 Compare May 24, 2022 13:45

cloud-fan reviewed May 26, 2022

View reviewed changes

cloud-fan reviewed May 27, 2022

View reviewed changes

wangyum added 5 commits May 27, 2022 16:25

SPARK-39203: Rewrite table location to absolute location based on dat…

3aa1ffe

…abase location

fix

ef049b9

Simplify

ea7e42c

Fix

2a3e511

Address comments

23c7e68

wangyum force-pushed the SPARK-39203 branch from 7a3acea to 23c7e68 Compare May 27, 2022 08:25

wangyum force-pushed the SPARK-39203 branch from 3d89a9f to 6b94419 Compare May 28, 2022 03:10

cloud-fan reviewed May 30, 2022

View reviewed changes

fix

facef7d

cloud-fan approved these changes May 31, 2022

View reviewed changes

wangyum changed the title ~~[SPARK-39203][SQL] Rewrite table location to absolute location based on database location~~ [SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI May 31, 2022

wangyum closed this in f969b88 May 31, 2022

wangyum deleted the SPARK-39203 branch May 31, 2022 10:11

cloud-fan mentioned this pull request Oct 20, 2022

[SPARK-39203][SQL][FOLLOWUP] Do not qualify view location #38321

Closed

cloud-fan mentioned this pull request Apr 20, 2023

[SPARK-43373][SQL] Revert [SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI #40871

Closed

	val rawTable = getRawTable(db, table)
	val catalogTable = restoreTableMetadata(rawTable)
	val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
	val clientPrunedPartitions =
	client.getPartitionsByFilter(rawTable, predicates).map { part =>
	part.copy(spec = restorePartitionSpec(part.spec, partColNameMap))
	}

	override def getPartitionsByFilter(
	table: CatalogTable,
	predicates: Seq[Expression]): Seq[CatalogTablePartition] = withHiveState {
	val hiveTable = toHiveTable(table, Some(userName))
	val parts = shim.getPartitionsByFilter(client, hiveTable, predicates, table)
	.map(fromHivePartition)
	HiveCatalogMetrics.incrementFetchedPartitions(parts.length)
	parts
	}

[SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI #36625

[SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI #36625

Uh oh!

Conversation

wangyum commented May 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangyum commented May 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum commented May 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangyum commented May 21, 2022 •

edited

Loading

wangyum May 27, 2022 •

edited

Loading

cloud-fan May 30, 2022 •

edited

Loading