-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync #1810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
garyli1019
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lw309637554 Thanks for your contribution! Overall looks good.
Open to discuss where to put these modules. I vote to put the base class to hudi-common and have separate modules for different query engines.
...ync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/AbstractSyncHoodieClient.java
Outdated
Show resolved
Hide resolved
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: alphabetical order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
6bf22a2 to
7013097
Compare
thanks so much for your very valuable suggestion. |
|
@vinothchandar I meet some mistakes, just opened a new PR. what about your suggestion? |
65edf78 to
843f16a
Compare
| // for backward compatibility | ||
| if (hiveSyncEnabled) { | ||
| metaSyncEnabled = true | ||
| syncClientToolClass = DEFAULT_SYNC_CLIENT_TOOL_CLASS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if user sync both hive and dla meta, the dla meta would not get synced.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this will back compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would users sync to both? down the line it may make sense to provide support for syncing to multiple things.
but even here, if we just append the HiveSync class when hiveSyncEnabled=true, we can support syncing to both Hive and dla?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, when user set hiveSyncEnabled and --sync-tool-classes, sync both hive and --sync-tool-classes make sense. i will fix it
| Option<String> lastCommitTimeSynced = Option.empty(); | ||
| /*if (tableExists) { | ||
| lastCommitTimeSynced = hoodieDLAClient.getLastCommitTimeSynced(tableName); | ||
| }*/ | ||
| LOG.info("Last commit time synced was found to be " + lastCommitTimeSynced.orElse("null")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since dla meta do not support alter table properties yet, it would be simpler here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
leesf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, left some minor comments.
|
@vinothchandar @garyli1019 I think in this PR the hudi-sync abstract is ready. Expect your review. Thanks
what about your suggestion? |
garyli1019
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lw309637554 , I totally understand the importance of the backward capability. IMO, that will be great if we can remove the hive dependency from hudi-spark and hudi-utilities. If we treat syncHive separately, we still need to include some Hive related packages in these two modules.
I had this dependency issue before while I was testing the delta streamer. I didn't use Hive at all but I need to resolve some Hive dependency conflicts in my production environment. So I'd incline to sacrifice some backward capability and move all the Hive related packages to hudi-hive-sync. Do you think this is possible?
Happy to hear what you guys think.
this is a good point. I think remove hive dependency from hudi-spark and hudi-utilities is a another work . we can open another issue resolve it |
sounds good. follow up ticket: https://issues.apache.org/jira/browse/HUDI-1101 |
very nice |
|
@vinothchandar The pr is ready overall. Can you help to review ? |
|
@lw309637554 can you please give me a couple days. I am trying to prioritize all the 0.6.0 blockers for now. |
okay ,thanks |
|
@lw309637554 would you please rebase and fix the conflicts. |
okay, done |
|
This is on my plate for this week. |
We can discuss on the JIRA. but this needs more thought. We want spark datasource write and deltastreamer to continue to sync to hive, when the write completes. So, its a necessary thing IMO |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive related changes and overall structure looks good to me. Thanks for an elegant implementation @lw309637554 .
Left some comments around making the hive vs meta handling more generic. Let me know what you think. if supporting multiple sync targets is a desirable thing, then it may make sense to structure like that.
| // for backward compatibility | ||
| if (hiveSyncEnabled) { | ||
| metaSyncEnabled = true | ||
| syncClientToolClass = DEFAULT_SYNC_CLIENT_TOOL_CLASS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would users sync to both? down the line it may make sense to provide support for syncing to multiple things.
but even here, if we just append the HiveSync class when hiveSyncEnabled=true, we can support syncing to both Hive and dla?
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Show resolved
Hide resolved
| if (hiveSync) { | ||
| Metrics.registerGauge(getMetricsName("deltastreamer", "hiveSyncDuration"), getDurationInMs(syncNs)); | ||
| } else { | ||
| Metrics.registerGauge(getMetricsName("deltastreamer", "metaSyncDuration"), getDurationInMs(syncNs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if both hive and meta sync are off? we would still emit metrics for meta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we derive the metric name from the sync tool class. i.e instead of metaSyncDuration, we do dlaSyncDuration? that seems more usable and understandable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have do it , different sync tool class have its own metrics with name of sync class
| // Send DeltaStreamer Metrics | ||
| metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs); | ||
| metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs, true); | ||
| metrics.updateDeltaStreamerMetrics(overallTimeMs, metaSyncTimeMs, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to do this by iterating over the configured sync tool classes? i.e only do it when sync is configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok , have do this in syncMeta
|
@lw309637554 I rebased this off master and also did some of the smaller stuff myself. if we can make a call on the multiple targets and the metrics questions, we can can resolve and land |
lw309637554
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lw309637554 I rebased this off master and also did some of the smaller stuff myself. if we can make a call on the multiple targets and the metrics questions, we can can resolve and land
thanks @vinothchandar
a. about metrics questions: every sync class have a metric with the sync class name as metric name
b. when user set sync-hive-enable, just add sync-hive to the sync classes, all the sync classes will sync. as both sync hive and dla
| // for backward compatibility | ||
| if (hiveSyncEnabled) { | ||
| metaSyncEnabled = true | ||
| syncClientToolClass = DEFAULT_SYNC_CLIENT_TOOL_CLASS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, when user set hiveSyncEnabled and --sync-tool-classes, sync both hive and --sync-tool-classes make sense. i will fix it
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Show resolved
Hide resolved
| if (hiveSync) { | ||
| Metrics.registerGauge(getMetricsName("deltastreamer", "hiveSyncDuration"), getDurationInMs(syncNs)); | ||
| } else { | ||
| Metrics.registerGauge(getMetricsName("deltastreamer", "metaSyncDuration"), getDurationInMs(syncNs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have do it , different sync tool class have its own metrics with name of sync class
| // Send DeltaStreamer Metrics | ||
| metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs); | ||
| metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs, true); | ||
| metrics.updateDeltaStreamerMetrics(overallTimeMs, metaSyncTimeMs, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok , have do this in syncMeta
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lw309637554 one issue with the datasource config parsing.
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-sync/hudi-dla-sync/src/main/java/org/apache/hudi/dla/DLASyncTool.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use // TODO here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change to warn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use HiveSyncTool.class.getName here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use HiveSyncTool.class.getName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
3ba4653 to
fe598ff
Compare
|
@lw309637554 is this ready for a final review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the entire class name here? Would that not make for a long metric name? :)
May be have a getShortName() method for the AbstractSyncTool class and return "hive" and "dla" from them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ,i will do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leesf @lw309637554 I reviewed the logic again. It looks good to me.
Left one last suggestion on using a short name for metrics, vs using the class name as is.
I will let you both decide and @leesf can land this, after a final review
015ac1b to
856ef5a
Compare
856ef5a to
3808c32
Compare
|
@lw309637554 this seems ready? |
yes, but the test failed for some reason, i rerun it |
5014e14 to
4dae21d
Compare
Tips
What is the purpose of the pull request
Abstract hudi-sync-common ,and migrate the hudi-hive-sync to hudi-sync-hive. hudi-sync-hive implement the hudi-sync-common .
Then will support hudi-sync-aliyun-dla implement the hudi-sync-common .
This is the RFC https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service.
And the old PR is #1716 ,which just abstract the hudi-sync-common.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.