-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6550] Add Hadoop conf to HiveConf for HiveSyncConfig #9221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @xushiyan, I noticed the casting from hadoopConf to hiveConf was introduced by this PR from you(#6202) but I couldn't find any context. Could you help me learn why we made that change? |
hey @CTTY it's probably meant for being fully compatible with the original code, as it was done for refactoring. |
| ? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class); | ||
| HiveConf hiveConf = new HiveConf(); | ||
| // HiveConf needs to load Hadoop conf to allow instantiation via AWSGlueClientFactory | ||
| hiveConf.addResource(hadoopConf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not so sure if this is equivalent to holding the original hadoopConf as this changes the order of addResources() during constructing. We should be good only if we can verify the equivalence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the ideal approach is to make the passed-in hiveConf load hadoop conf properly to use AWSGlueClientFactory at the very beginning (when creating hive sync config) so that nothing needs to load at this point. cc @yihua
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading AWSGlueClientFactory property specifically should solve the issue on AWS side, but it's possible that there are other configs/custom configs passed in via Spark session, and won't be loaded by hard-coded logic. I still think loading entire hadoop conf here would be a safer choice.
And this change could change the order of adding resource for hive conf. I've gone thru the HiveConf constructor but didn't see any usage of resource during constructing so I think it shouldn't affect, maybe I've overlooked something?
An alternative solution would be always pass hadoopConf to HiveConf constructor. Wdyt?
| hiveConf.addResource(hadoopConf); | |
| HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's possible that there are other configs/custom configs passed in via Spark session,
Is this a classical way people pass around hive options with spark?
An alternative solution would be always pass hadoopConf to HiveConf constructor
Does it introduce too much overhead then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @danny0405,
- It's not the preferred way, but it might be the best way in cases like serverless applications since those can't modify
hive-site.xmleasily - There is overhead to load hadoop conf to hive conf instead of casting it directly, but it's almost negligible imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CTTY always creating a new HiveConf looks good to me.
68f6b69 to
d61cae5
Compare
|
@CTTY will you be able to test and verify the change before we land it? it's a blocker for 0.14.0 |
|
Hey @xushiyan, I've tested this fix manually with EMR Serverless and it works fine |
d61cae5 to
68f6b69
Compare
|
Hi @xushiyan , regarding the Azure CI failure, I was able to reproduce locally but haven't got a chance to root cause it. Issue: The issue above only show up when using I think we should proceed with the first version of change if the order of loading hadoop conf doesn't break any Hudi components. Please let me know what you think:) |
@CTTY ok the order of loading resources makes a difference: 1st approach first adds |
68f6b69 to
63a32a9
Compare
|
@danny0405 do you have other input? |
Fine with it, it should not have resource reference leak like before right? |
@danny0405 |
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. will merge this now.
This commits fix the Hive sync config by creating new HiveConf object every time when initializing HiveSyncConfig and adding hadoopConf as resource. We have to load Hadoop conf otherwise properties like `--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory` won't be able to be passed via Spark Hudi job. Co-authored-by: Shawn Chang <[email protected]>
This commits fix the Hive sync config by creating new HiveConf object every time when initializing HiveSyncConfig and adding hadoopConf as resource. We have to load Hadoop conf otherwise properties like `--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory` won't be able to be passed via Spark Hudi job. Co-authored-by: Shawn Chang <[email protected]>
Change Logs
HiveSyncConfigand add hadoopConf as resourceWe have to load Hadoop conf otherwise properties like
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactorywon't be able to passed via Spark Hudi job. This issue was introduced another PR to address the OOM issue changed it: [HUDI-5855] Release resource actively for Flink hive meta sync #8050To address the OOM problem we can create a new HiveConf every time instead of casting hadoop conf to hive conf. So we don't keep references to old hadoop conf.
Impact
None
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist