-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19664][SQL]put hive.metastore.warehouse.dir in hadoopconf to overwrite its original value #16996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…verwrite its original value
|
Test build #73144 has finished for PR 16996 at commit
|
|
Test build #73152 has started for PR 16996 at commit |
|
retest this please |
|
Test build #73155 has finished for PR 16996 at commit
|
|
cc @cloud-fan |
| // When neither spark.sql.warehouse.dir nor hive.metastore.warehouse.dir is set, | ||
| // we will set hive.metastore.warehouse.dir to the default value of spark.sql.warehouse.dir. | ||
| val sparkWarehouseDir = sparkContext.conf.get(WAREHOUSE_PATH) | ||
| logInfo(s"${WAREHOUSE_PATH.key} is set, Setting " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove "${WAREHOUSE_PATH.key} is set,
|
In your PR title, |
|
Why we need to overwrite its original value? I thought we will not use that value any more. |
|
this is a minor improvement, it is better to keep the same value between sparkConf and hadoopConf, and add some loginfo for user debug. |
|
cc @yhuai who did the original change. I am not sure whether we need to overwrite the original value of hadoopConf, although the change does not hurt anything IMO. |
|
Test build #73198 has finished for PR 16996 at commit
|
|
@yhuai could you help to review this? thanks~ |
| val warehousePath = { | ||
| val configFile = Utils.getContextOrSparkClassLoader.getResource("hive-site.xml") | ||
| if (configFile != null) { | ||
| logInfo(s"load config from hive-site.xml $configFile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: loading hive config file: $configFile
| // When neither spark.sql.warehouse.dir nor hive.metastore.warehouse.dir is set, | ||
| // we will set hive.metastore.warehouse.dir to the default value of spark.sql.warehouse.dir. | ||
| val sparkWarehouseDir = sparkContext.conf.get(WAREHOUSE_PATH) | ||
| logInfo(s"Setting hive.metastore.warehouse.dir ($hiveWarehouseDir) to the value of " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: hive.metastore.warehouse.dir is not set, setting it to the value of ${WAREHOUSE_PATH.key}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we hit this condition because we provide a WAREHOUSE_PATH , maybe the hive.metastore.warehouse.dir also have been set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
($hiveWarehouseDir) -> ('$hiveWarehouseDir')
| if (hiveWarehouseDir != null) { | ||
| sparkContext.hadoopConfiguration.set("hive.metastore.warehouse.dir", sparkWarehouseDir) | ||
| } | ||
| sparkContext.conf.set("hive.metastore.warehouse.dir", sparkWarehouseDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed hive.metastore.warehouse.dir in sparkConf above commit ,but some tests failed, so I revert this logic, let me dig it more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look into the code, and make some tests.
The hive.metastore.warehouse.dir in sparkConf still take effect in Spark, it is not useless.
The reason is that:
- when we run spark with HiveEnabled, it will create
ShareState - when create
ShareState, it will create aHiveExternalCatalog
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L85
3.when createHiveExternalCatalog, it will CreateHiveClientImplbyHiveUtils
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L65
4.when createHiveClientImpl, it will call SessionState.start(state) - and then in the
SessionState.start(state), it willcreate a default databaseusinghive.metastore.warehouse.dirinhiveConfwhich is created inHiveClientImplhttps://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L189 - while the
hiveConfcreated inHiveClientImplfromhadoopConfandsparkConf, andsparkConfwill overwrite the value of the same key in hadoopConf. So it means that it actually will usehive.metastore.warehouse.dirinsparkConfto create the default database, if we does not overwrite the value insparkConfin SharedState, the database location is not we expected which is the warehouse path. So heresparkContext.conf.set("hive.metastore.warehouse.dir", sparkWarehouseDir)should be retained
we can also find that,the default database does not created in SharedState, here condition is false, will not hit the create database logic. it has been created when we init the HiveClientImpl
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L96
| logInfo(s"Setting hive.metastore.warehouse.dir ($hiveWarehouseDir) to the value of " + | ||
| s"${WAREHOUSE_PATH.key} ('$sparkWarehouseDir').") | ||
| if (hiveWarehouseDir != null) { | ||
| sparkContext.hadoopConfiguration.set("hive.metastore.warehouse.dir", sparkWarehouseDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can always set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, maybe when we always set it, the sparkContext.conf.set("hive.metastore.warehouse.dir", sparkWarehouseDir) could be removed, let me test it more, thanks~
|
Test build #73340 has finished for PR 16996 at commit
|
|
Test build #73343 has finished for PR 16996 at commit
|
| logInfo(s"Setting hive.metastore.warehouse.dir ($hiveWarehouseDir) to the value of " + | ||
| s"${WAREHOUSE_PATH.key} ('$sparkWarehouseDir').") | ||
| sparkContext.hadoopConfiguration.set("hive.metastore.warehouse.dir", sparkWarehouseDir) | ||
| sparkContext.conf.set("hive.metastore.warehouse.dir", sparkWarehouseDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about sparkContext.conf.remove("hive.metastore.warehouse.dir")? it's good to have a single source of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is ok to remove it, and it will be more clear that hive.metastore.warehouse.dir and WAREHOUSE_PATH are all at their right places.
|
LGTM |
|
Test build #73389 has finished for PR 16996 at commit
|
|
thanks, merging to master! |
…overwrite its original value ## What changes were proposed in this pull request? In [SPARK-15959](https://issues.apache.org/jira/browse/SPARK-15959), we bring back the `hive.metastore.warehouse.dir` , while in the logic, when use the value of `spark.sql.warehouse.dir` to overwrite `hive.metastore.warehouse.dir` , it set it to `sparkContext.conf` which does not overwrite the value is hadoopConf, I think it should put in `sparkContext.hadoopConfiguration` and overwrite the original value of hadoopConf https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 ## How was this patch tested? N/A Author: windpiger <[email protected]> Closes apache#16996 from windpiger/hivemetawarehouseConf.
What changes were proposed in this pull request?
In SPARK-15959, we bring back the
hive.metastore.warehouse.dir, while in the logic, when use the value ofspark.sql.warehouse.dirto overwritehive.metastore.warehouse.dir, it set it tosparkContext.confwhich does not overwrite the value is hadoopConf, I think it should put insparkContext.hadoopConfigurationand overwrite the original value of hadoopConfhttps://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64
How was this patch tested?
N/A