Skip to content

Conversation

@CTTY
Copy link
Contributor

@CTTY CTTY commented Jul 18, 2023

Change Logs

  • Create new HiveConf object every time when initializing HiveSyncConfig and add hadoopConf as resource
    We have to load Hadoop conf otherwise properties like --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory won't be able to passed via Spark Hudi job. This issue was introduced another PR to address the OOM issue changed it: [HUDI-5855] Release resource actively for Flink hive meta sync #8050

To address the OOM problem we can create a new HiveConf every time instead of casting hadoop conf to hive conf. So we don't keep references to old hadoop conf.

Impact

None

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@CTTY
Copy link
Contributor Author

CTTY commented Jul 18, 2023

Hi @xushiyan, I noticed the casting from hadoopConf to hiveConf was introduced by this PR from you(#6202) but I couldn't find any context. Could you help me learn why we made that change?

HiveConf hiveConf = hadoopConf instanceof HiveConf
        ? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class);

@umehrot2 umehrot2 requested review from xushiyan and yihua July 19, 2023 18:43
@yihua yihua added priority:blocker Production down; release blocker release-0.14.0 labels Jul 19, 2023
@yihua yihua self-assigned this Jul 19, 2023
@xushiyan
Copy link
Member

Hi @xushiyan, I noticed the casting from hadoopConf to hiveConf was introduced by this PR from you(#6202) but I couldn't find any context. Could you help me learn why we made that change?

HiveConf hiveConf = hadoopConf instanceof HiveConf
        ? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class);

hey @CTTY it's probably meant for being fully compatible with the original code, as it was done for refactoring.

? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class);
HiveConf hiveConf = new HiveConf();
// HiveConf needs to load Hadoop conf to allow instantiation via AWSGlueClientFactory
hiveConf.addResource(hadoopConf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not so sure if this is equivalent to holding the original hadoopConf as this changes the order of addResources() during constructing. We should be good only if we can verify the equivalence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the ideal approach is to make the passed-in hiveConf load hadoop conf properly to use AWSGlueClientFactory at the very beginning (when creating hive sync config) so that nothing needs to load at this point. cc @yihua

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading AWSGlueClientFactory property specifically should solve the issue on AWS side, but it's possible that there are other configs/custom configs passed in via Spark session, and won't be loaded by hard-coded logic. I still think loading entire hadoop conf here would be a safer choice.

And this change could change the order of adding resource for hive conf. I've gone thru the HiveConf constructor but didn't see any usage of resource during constructing so I think it shouldn't affect, maybe I've overlooked something?
An alternative solution would be always pass hadoopConf to HiveConf constructor. Wdyt?

Suggested change
hiveConf.addResource(hadoopConf);
HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's possible that there are other configs/custom configs passed in via Spark session,

Is this a classical way people pass around hive options with spark?

An alternative solution would be always pass hadoopConf to HiveConf constructor

Does it introduce too much overhead then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @danny0405,

  1. It's not the preferred way, but it might be the best way in cases like serverless applications since those can't modify hive-site.xml easily
  2. There is overhead to load hadoop conf to hive conf instead of casting it directly, but it's almost negligible imo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CTTY always creating a new HiveConf looks good to me.

@xushiyan
Copy link
Member

@CTTY will you be able to test and verify the change before we land it? it's a blocker for 0.14.0

@CTTY
Copy link
Contributor Author

CTTY commented Aug 30, 2023

Hey @xushiyan, I've tested this fix manually with EMR Serverless and it works fine

@CTTY CTTY force-pushed the ctty/hiveconf-fix branch from d61cae5 to 68f6b69 Compare September 2, 2023 04:38
@CTTY
Copy link
Contributor Author

CTTY commented Sep 2, 2023

Hi @xushiyan , regarding the Azure CI failure, I was able to reproduce locally but haven't got a chance to root cause it.

Issue:

org.apache.hudi.hive.HoodieHiveSyncException: Failed to get the last replicated time from the table bar
	at org.apache.hudi.hive.HoodieHiveSyncClient.getLastReplicatedTime(HoodieHiveSyncClient.java:305)
	at org.apache.hudi.hive.replication.GlobalHiveSyncTool.getLastReplicatedTimeStampMap(GlobalHiveSyncTool.java:64)
	at org.apache.hudi.hive.replication.ReplicationStateSync.<init>(ReplicationStateSync.java:37)
	at org.apache.hudi.hive.replication.HiveSyncGlobalCommitTool.getReplicatedState(HiveSyncGlobalCommitTool.java:51)
	at org.apache.hudi.hive.replication.HiveSyncGlobalCommitTool.<init>(HiveSyncGlobalCommitTool.java:100)
	at org.apache.hudi.hive.replication.TestHiveSyncGlobalCommitTool.testHiveConfigShouldMatchClusterConf(TestHiveSyncGlobalCommitTool.java:113)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

The issue above only show up when using new HiveConf(hadoopConf, HiveConf.class). When it's reverted it to HiveConf hiveConf = new HiveConf(); hiveConf.addResource(hadoopConf); it works fine

I think we should proceed with the first version of change if the order of loading hadoop conf doesn't break any Hudi components. Please let me know what you think:)

@xushiyan
Copy link
Member

xushiyan commented Sep 4, 2023

The issue above only show up when using new HiveConf(hadoopConf, HiveConf.class). When it's reverted it to HiveConf hiveConf = new HiveConf(); hiveConf.addResource(hadoopConf); it works fine

@CTTY ok the order of loading resources makes a difference: 1st approach first adds hadoopConf then overrides whatever from hive-specific confs, 2nd approach does the opposite. In our case, we want to respect user-provided hadoopConf so we should add the resource after HiveConf initializes. I'll rebase master as the CI seemed not running for your last push.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Sep 4, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xushiyan
Copy link
Member

xushiyan commented Sep 6, 2023

@danny0405 do you have other input?

@danny0405
Copy link
Contributor

@danny0405 do you have other input?

Fine with it, it should not have resource reference leak like before right?

@CTTY
Copy link
Contributor Author

CTTY commented Sep 12, 2023

Fine with it, it should not have resource reference leak like before right?

@danny0405
No, it won't have OOM issue since it'd create new HiveConf instance every time it's invoked instead of casting from hadoopConf

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. will merge this now.

@yihua yihua merged commit 32a8b85 into apache:master Sep 14, 2023
prashantwason pushed a commit that referenced this pull request Sep 19, 2023
This commits fix the Hive sync config by creating new HiveConf object every time when initializing HiveSyncConfig and adding hadoopConf as resource. We have to load Hadoop conf otherwise properties like `--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory` won't be able to be passed via Spark Hudi job.

Co-authored-by: Shawn Chang <[email protected]>
@CTTY CTTY deleted the ctty/hiveconf-fix branch September 21, 2023 17:23
TheR1sing3un pushed a commit to TheR1sing3un/hudi that referenced this pull request Feb 12, 2025
This commits fix the Hive sync config by creating new HiveConf object every time when initializing HiveSyncConfig and adding hadoopConf as resource. We have to load Hadoop conf otherwise properties like `--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory` won't be able to be passed via Spark Hudi job.

Co-authored-by: Shawn Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.14.1

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

6 participants