Skip to content

[HUDI-3112] Fix KafkaConnect can not sync to Hive Problem#4458

Merged
yihua merged 1 commit intoapache:masterfrom
cdmikechen:HUDI-3112
Jan 9, 2022
Merged

[HUDI-3112] Fix KafkaConnect can not sync to Hive Problem#4458
yihua merged 1 commit intoapache:masterfrom
cdmikechen:HUDI-3112

Conversation

@cdmikechen
Copy link
Copy Markdown
Contributor

@cdmikechen cdmikechen commented Dec 28, 2021

What is the purpose of the pull request

KafkaConnect use org.apache.hudi.DataSourceUtils to build HiveSyncConfig now, but DataSourceUtils import some spark dependencies. So that Hive sync will fail because of the application of related classes.

Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.types.DataType
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 66 more

Brief change log

  • Add a separate method to create HiveSyncConfig

Verify this pull request

Need to add Hive sync test by https://issues.apache.org/jira/browse/HUDI-2673

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@cdmikechen
Copy link
Copy Markdown
Contributor Author

@yihua Thanks for reviewing this

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 3, 2022
/**
* Build Hive Sync Config
*/
public HiveSyncConfig buildSyncConfig(TypedProperties props, String tableBasePath) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem is due to irrelevant classes to Kafka Connect imported from DataSourceUtils, is it possible to move DataSourceUtils::buildHiveSyncConfig to a different/new util class so buildHiveSyncConfig() can still be reused here, instead of duplicating the code in hudi-kafka-connect module? Should that solve the problem?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua
I had checked the Hudi project when I modified the codes. Besides hive synchronization in spark, Flink also has the same problem. However, in the flink, they also redeclared a new set of variables to solve the problem.

public static final ConfigOption<String> HIVE_SYNC_MODE = ConfigOptions
.key("hive_sync.mode")
.stringType()
.defaultValue("jdbc")
.withDescription("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'jdbc'");
public static final ConfigOption<String> HIVE_SYNC_USERNAME = ConfigOptions
.key("hive_sync.username")
.stringType()
.defaultValue("hive")
.withDescription("Username for hive sync, default 'hive'");
public static final ConfigOption<String> HIVE_SYNC_PASSWORD = ConfigOptions
.key("hive_sync.password")
.stringType()
.defaultValue("hive")
.withDescription("Password for hive sync, default 'hive'");
public static final ConfigOption<String> HIVE_SYNC_JDBC_URL = ConfigOptions
.key("hive_sync.jdbc_url")
.stringType()
.defaultValue("jdbc:hive2://localhost:10000")
.withDescription("Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000'");
public static final ConfigOption<String> HIVE_SYNC_METASTORE_URIS = ConfigOptions
.key("hive_sync.metastore.uris")
.stringType()
.defaultValue("")
.withDescription("Metastore uris for hive sync, default ''");
public static final ConfigOption<String> HIVE_SYNC_PARTITION_FIELDS = ConfigOptions
.key("hive_sync.partition_fields")
.stringType()
.defaultValue("")
.withDescription("Partition fields for hive sync, default ''");

Considering that if unification is a relatively large part of adjustment, it may be a better way to solve it with a new issue. Because there are some Scala logic in hive sync, it cannot be split directly.

Copy link
Copy Markdown
Contributor

@yihua yihua Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdmikechen Understood. I'm thinking about only moving util methods related Hive sync configs, not the Hive sync logic, to a separate Util class. The worry I have is that hive sync configs are spread into different places now and they may diverge if we forget to update all of them to be consistent.

We can keep this PR as is for now. @cdmikechen could you create a Jira ticket to track the Hive sync config unification, which will be done in a different PR in future?

@nsivabalan
Copy link
Copy Markdown
Contributor

@cdmikechen @yihua : We are targeting this patch for 0.10.1. We have code freeze planned this monday. Would be nice to get this in by then. Wanted to send out a reminder.

/**
* Build Hive Sync Config
*/
public HiveSyncConfig buildSyncConfig(TypedProperties props, String tableBasePath) {
Copy link
Copy Markdown
Contributor

@yihua yihua Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdmikechen Understood. I'm thinking about only moving util methods related Hive sync configs, not the Hive sync logic, to a separate Util class. The worry I have is that hive sync configs are spread into different places now and they may diverge if we forget to update all of them to be consistent.

We can keep this PR as is for now. @cdmikechen could you create a Jira ticket to track the Hive sync config unification, which will be done in a different PR in future?

/**
* Build Hive Sync Config
*/
public HiveSyncConfig buildSyncConfig(TypedProperties props, String tableBasePath) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this util method to KafkaConnectUtils class.

@cdmikechen
Copy link
Copy Markdown
Contributor Author

@yihua
Hi~ I've moved buildSyncConfig method to KafkaConnectUtils and open a new issue https://issues.apache.org/jira/browse/HUDI-3199

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Jan 9, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit e9a7f49 into apache:master Jan 9, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants