Skip to content

Conversation

@lw309637554
Copy link
Contributor

Tips

What is the purpose of the pull request

Abstract hudi-sync-common ,and migrate the hudi-hive-sync to hudi-sync-hive. hudi-sync-hive implement the hudi-sync-common .

Then will support hudi-sync-aliyun-dla implement the hudi-sync-common .

This is the RFC https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service.

And the old PR is #1716 ,which just abstract the hudi-sync-common.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw309637554 Thanks for your contribution! Overall looks good.
Open to discuss where to put these modules. I vote to put the base class to hudi-common and have separate modules for different query engines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alphabetical order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lw309637554 lw309637554 force-pushed the HUDI-875-lw branch 2 times, most recently from 6bf22a2 to 7013097 Compare July 9, 2020 07:36
@lw309637554
Copy link
Contributor Author

@lw309637554 Thanks for your contribution! Overall looks good.
Open to discuss where to put these modules. I vote to put the base class to hudi-common and have separate modules for different query engines.

thanks so much for your very valuable suggestion.

@lw309637554
Copy link
Contributor Author

@vinothchandar I meet some mistakes, just opened a new PR.
In the new PR, I fix the build break. This is my thinks about backwards compatible .
a. about users code compatible: because do not modify the hivesync class name. Users do not need to modify these old code.
b. about users pom dependency compatible: because do not modify the module name hudi-hive-sync. Users not need to modify pom dependency
c. about users local jar files compatible: if users's local jars is do not shaded the hudi-hive-sync's
Indirect dependence. Old users just need add the hudi-sync-common.jar to their directory. Just like hudi-utilities-bundle modiry in this PR.
Reply: we can do this to put the hudi-sync-common base class to hudi-common
d. about sync parameters, just add new parameters. Compatible is ok.

what about your suggestion?

@lw309637554 lw309637554 force-pushed the HUDI-875-lw branch 2 times, most recently from 65edf78 to 843f16a Compare July 15, 2020 03:25
// for backward compatibility
if (hiveSyncEnabled) {
metaSyncEnabled = true
syncClientToolClass = DEFAULT_SYNC_CLIENT_TOOL_CLASS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if user sync both hive and dla meta, the dla meta would not get synced.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this will back compatibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would users sync to both? down the line it may make sense to provide support for syncing to multiple things.

but even here, if we just append the HiveSync class when hiveSyncEnabled=true, we can support syncing to both Hive and dla?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when user set hiveSyncEnabled and --sync-tool-classes, sync both hive and --sync-tool-classes make sense. i will fix it

Comment on lines +121 to +125
Option<String> lastCommitTimeSynced = Option.empty();
/*if (tableExists) {
lastCommitTimeSynced = hoodieDLAClient.getLastCommitTimeSynced(tableName);
}*/
LOG.info("Last commit time synced was found to be " + lastCommitTimeSynced.orElse("null"));
Copy link
Contributor

@leesf leesf Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since dla meta do not support alter table properties yet, it would be simpler here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, left some minor comments.

@lw309637554
Copy link
Contributor Author

@vinothchandar @garyli1019 I think in this PR the hudi-sync abstract is ready. Expect your review. Thanks

  1. the module abstract is
    hudi-sync
    hudi-hive-sync/
    hudi-dla-sync/

  2. about backwards compatible .
    a. about users code compatible: because do not modify the hivesync class name. Users do not need to modify these old code.
    b. about users pom dependency compatible: because do not modify the module name hudi-hive-sync. Users not need to modify pom dependency
    c. about users local jar files compatible: if users's local jars is do not shaded the hudi-hive-sync's
    Indirect dependence. Old users just need add the hudi-sync-common.jar to their directory. Just like hudi-utilities-bundle modiry in this PR.
    d. about sync parameters, just add new parameters such as --enable-sync as new default parameter,also backwards compatible --enable-hive-sync.

  3. some others works ,such as update the doc. will do in other issues.

what about your suggestion?

Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lw309637554 , I totally understand the importance of the backward capability. IMO, that will be great if we can remove the hive dependency from hudi-spark and hudi-utilities. If we treat syncHive separately, we still need to include some Hive related packages in these two modules.
I had this dependency issue before while I was testing the delta streamer. I didn't use Hive at all but I need to resolve some Hive dependency conflicts in my production environment. So I'd incline to sacrifice some backward capability and move all the Hive related packages to hudi-hive-sync. Do you think this is possible?
Happy to hear what you guys think.

@lw309637554
Copy link
Contributor Author

Hi @lw309637554 , I totally understand the importance of the backward capability. IMO, that will be great if we can remove the hive dependency from hudi-spark and hudi-utilities. If we treat syncHive separately, we still need to include some Hive related packages in these two modules.
I had this dependency issue before while I was testing the delta streamer. I didn't use Hive at all but I need to resolve some Hive dependency conflicts in my production environment. So I'd incline to sacrifice some backward capability and move all the Hive related packages to hudi-hive-sync. Do you think this is possible?
Happy to hear what you guys think.

this is a good point. I think remove hive dependency from hudi-spark and hudi-utilities is a another work . we can open another issue resolve it

@garyli1019
Copy link
Member

this is a good point. I think remove hive dependency from hudi-spark and hudi-utilities is a another work . we can open another issue resolve it

sounds good. follow up ticket: https://issues.apache.org/jira/browse/HUDI-1101

@lw309637554
Copy link
Contributor Author

this is a good point. I think remove hive dependency from hudi-spark and hudi-utilities is a another work . we can open another issue resolve it

sounds good. follow up ticket: https://issues.apache.org/jira/browse/HUDI-1101

very nice

@lw309637554
Copy link
Contributor Author

@vinothchandar The pr is ready overall. Can you help to review ?

@vinothchandar
Copy link
Member

@lw309637554 can you please give me a couple days. I am trying to prioritize all the 0.6.0 blockers for now.

@lw309637554
Copy link
Contributor Author

couple

okay ,thanks

@leesf
Copy link
Contributor

leesf commented Aug 1, 2020

@lw309637554 would you please rebase and fix the conflicts.

@leesf leesf self-assigned this Aug 1, 2020
@lw309637554
Copy link
Contributor Author

@lw309637554 would you please rebase and fix the conflicts.

okay, done

@vinothchandar
Copy link
Member

This is on my plate for this week.

@vinothchandar
Copy link
Member

IMO, that will be great if we can remove the hive dependency from hudi-spark and hudi-utilities

We can discuss on the JIRA. but this needs more thought. We want spark datasource write and deltastreamer to continue to sync to hive, when the write completes. So, its a necessary thing IMO

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive related changes and overall structure looks good to me. Thanks for an elegant implementation @lw309637554 .

Left some comments around making the hive vs meta handling more generic. Let me know what you think. if supporting multiple sync targets is a desirable thing, then it may make sense to structure like that.

// for backward compatibility
if (hiveSyncEnabled) {
metaSyncEnabled = true
syncClientToolClass = DEFAULT_SYNC_CLIENT_TOOL_CLASS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would users sync to both? down the line it may make sense to provide support for syncing to multiple things.

but even here, if we just append the HiveSync class when hiveSyncEnabled=true, we can support syncing to both Hive and dla?

if (hiveSync) {
Metrics.registerGauge(getMetricsName("deltastreamer", "hiveSyncDuration"), getDurationInMs(syncNs));
} else {
Metrics.registerGauge(getMetricsName("deltastreamer", "metaSyncDuration"), getDurationInMs(syncNs));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if both hive and meta sync are off? we would still emit metrics for meta?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we derive the metric name from the sync tool class. i.e instead of metaSyncDuration, we do dlaSyncDuration? that seems more usable and understandable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have do it , different sync tool class have its own metrics with name of sync class

// Send DeltaStreamer Metrics
metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs);
metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs, true);
metrics.updateDeltaStreamerMetrics(overallTimeMs, metaSyncTimeMs, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to do this by iterating over the configured sync tool classes? i.e only do it when sync is configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , have do this in syncMeta

@vinothchandar
Copy link
Member

@lw309637554 I rebased this off master and also did some of the smaller stuff myself. if we can make a call on the multiple targets and the metrics questions, we can can resolve and land

Copy link
Contributor Author

@lw309637554 lw309637554 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw309637554 I rebased this off master and also did some of the smaller stuff myself. if we can make a call on the multiple targets and the metrics questions, we can can resolve and land

thanks @vinothchandar
a. about metrics questions: every sync class have a metric with the sync class name as metric name
b. when user set sync-hive-enable, just add sync-hive to the sync classes, all the sync classes will sync. as both sync hive and dla

// for backward compatibility
if (hiveSyncEnabled) {
metaSyncEnabled = true
syncClientToolClass = DEFAULT_SYNC_CLIENT_TOOL_CLASS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when user set hiveSyncEnabled and --sync-tool-classes, sync both hive and --sync-tool-classes make sense. i will fix it

if (hiveSync) {
Metrics.registerGauge(getMetricsName("deltastreamer", "hiveSyncDuration"), getDurationInMs(syncNs));
} else {
Metrics.registerGauge(getMetricsName("deltastreamer", "metaSyncDuration"), getDurationInMs(syncNs));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have do it , different sync tool class have its own metrics with name of sync class

// Send DeltaStreamer Metrics
metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs);
metrics.updateDeltaStreamerMetrics(overallTimeMs, hiveSyncTimeMs, true);
metrics.updateDeltaStreamerMetrics(overallTimeMs, metaSyncTimeMs, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , have do this in syncMeta

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lw309637554 one issue with the datasource config parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use // TODO here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@leesf leesf Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change to warn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use HiveSyncTool.class.getName here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use HiveSyncTool.class.getName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vinothchandar
Copy link
Member

@lw309637554 is this ready for a final review?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the entire class name here? Would that not make for a long metric name? :)

May be have a getShortName() method for the AbstractSyncTool class and return "hive" and "dla" from them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ,i will do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf @lw309637554 I reviewed the logic again. It looks good to me.
Left one last suggestion on using a short name for metrics, vs using the class name as is.

I will let you both decide and @leesf can land this, after a final review

@lw309637554 lw309637554 force-pushed the HUDI-875-lw branch 2 times, most recently from 015ac1b to 856ef5a Compare August 5, 2020 09:55
@vinothchandar
Copy link
Member

@lw309637554 this seems ready?

@lw309637554
Copy link
Contributor Author

@lw309637554 this seems ready?

yes, but the test failed for some reason, i rerun it

@vinothchandar vinothchandar merged commit 51ea27d into apache:master Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants