-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs #4175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f74bd30 to
f6a00f3
Compare
8e0bc42 to
9c13f28
Compare
|
@rmahindra123 : I can help review this patch. Can you rebase w/ latest master. also, lets sync up sometime to get some context on the patch. |
da0912c to
9202dc0
Compare
| if (this.config.isHiveLocal()) { | ||
| writer.getDeltaStreamerWrapper().getDeltaSyncService().getDeltaSync() | ||
| .syncHive(getLocalHiveServer().getHiveConf()); | ||
| hiveSyncTool = new HiveSyncTool(writer.getWriteConfig().getProps(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we removed or deprecated DeltaSync.syncHive(...) in this patch? if not, would prefer to go via deltaSync syncHive api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a helper function right? The flow seems a bit weird to call the method from DeltaSync. I had removed the method in DeltaSync as well since this was the only place it was used. Can discuss more 1:1 if required.
...a-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableDropPartitionCommand.scala
Show resolved
Hide resolved
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
Show resolved
Hide resolved
| hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_ASSUME_DATE_PARTITION.key(), "true"); | ||
| hiveSyncProps.setProperty(HiveSyncConfig.HIVE_USE_PRE_APACHE_INPUT_FORMAT.key(), "false"); | ||
| hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_PARTITION_FIELDS.key(), "datestr"); | ||
| hiveSyncProps.setProperty(HiveSyncConfig.HIVE_BATCH_SYNC_PARTITION_NUM.key(), "3"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see you are setting jdbcUrl and you are additionally setting this BATCH_SYNC_PARTITION_NUM in this patch. is that intentional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thanks for the thorough review.
- I haven't removed it, its set in line 123.
hiveSyncProps.setProperty(HiveSyncConfig.HIVE_URL.key(), hiveTestService.getJdbcHive2Url());
2 For batch sync partition num, literally every test was setting it manually to HiveTestUtil.hiveSyncConfig.batchSyncNum = 3;
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java
Show resolved
Hide resolved
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHiveIncrementalPuller.java
Outdated
Show resolved
Hide resolved
|
@rmahindra123 : just to be cautious. can you try test this patch out for an existing hudi table for the hive sync functionality.
|
|
@xiarixiaoyao : can you help us test this patch out. Rajesh refactored hive sync quite a bit mostly around class structuring, config usages, instantiation etc. But if you can test this out and let us know if it looks good, would be nice. |
9202dc0 to
aa7efec
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments. for some of the resolved comments, I will sync up directly with you
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java
Show resolved
Hide resolved
|
addressed reviewer comments |
|
@xiarixiaoyao : Did you get a chance to test this patch out. |
|
@nsivabalan @rmahindra123 will review and test this pr This weekend |
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java
Outdated
Show resolved
Hide resolved
| hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_PARTITION_FIELDS.key(), "datestr"); | ||
| hiveSyncProps.setProperty(HiveSyncConfig.HIVE_BATCH_SYNC_PARTITION_NUM.key(), "3"); | ||
|
|
||
| hiveSyncConfig = new HiveSyncConfig(hiveSyncProps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer to have builder pattern to reliably construct HiveSyncConfig instead of passing uncontrollable "raw" props.
| protected TypedProperties props; | ||
|
|
||
| public AbstractSyncTool(Properties props, FileSystem fileSystem) { | ||
| public AbstractSyncTool(TypedProperties props, Configuration conf, FileSystem fs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a cleaner interface would be
| public AbstractSyncTool(TypedProperties props, Configuration conf, FileSystem fs) { | |
| public AbstractSyncTool(HoodieSyncConfig syncConfig, FileSystem fs) { |
hadoopConf should be set by fs.getConf(). And props is not controllable; hard for user to decide what to pass in here.
HoodieSyncConfig is the matching config class for SyncTool as the names suggest. Internally we can still do props = syncConfig.getProps();
...ync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestSyncUtilHelpers.java
Outdated
Show resolved
Hide resolved
...ync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestSyncUtilHelpers.java
Outdated
Show resolved
Hide resolved
| TypedProperties metaProps = new TypedProperties(); | ||
| metaProps.putAll(props); | ||
| metaProps.put(HoodieSyncConfig.META_SYNC_BASE_PATH, cfg.targetBasePath); | ||
| metaProps.put(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, cfg.baseFileFormat); | ||
| if (props.getBoolean(HiveSyncConfig.HIVE_SYNC_BUCKET_SYNC.key(), HiveSyncConfig.HIVE_SYNC_BUCKET_SYNC.defaultValue())) { | ||
| metaProps.put(HiveSyncConfig.HIVE_SYNC_BUCKET_SYNC_SPEC, HiveSyncConfig.getBucketSpec(props.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD.key()), | ||
| props.getInteger(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS.key()))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with builder pattern we should be able to simplify this sort of props construction
95caf2b to
60d9472
Compare
51aabb2 to
3520d00
Compare
| this.sparkSchemaLengthThreshold = getIntOrDefault(HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD); | ||
| this.createManagedTable = getBooleanOrDefault(HIVE_CREATE_MANAGED_TABLE); | ||
| this.bucketSpec = props.getString(HIVE_SYNC_BUCKET_SYNC_SPEC.key(), null); | ||
| this.bucketSpec = getStringOrDefault(HIVE_SYNC_BUCKET_SYNC_SPEC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmahindra123 Not sure why this was particularly set to null. seems like the default "" is ok.
| } | ||
|
|
||
| // enhance the similar function in child class | ||
| public static HiveSyncConfig copy(HiveSyncConfig cfg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used.
...sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/replication/GlobalHiveSyncConfig.java
Show resolved
Hide resolved
| @Deprecated | ||
| public AbstractSyncTool(Properties props, FileSystem fileSystem) { | ||
| this(new TypedProperties(props), new Configuration(), fileSystem); | ||
| this(new TypedProperties(props), fileSystem.getConf(), fileSystem); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmahindra123 if user provides an fs, shall we always make use of its conf? Actually not sure why the new API allow passing a different conf rather than use the one from fileSystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
d20111e to
de57bff
Compare
de57bff to
7481149
Compare
f102d78 to
309116a
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| @Deprecated | ||
| public AbstractSyncTool(Properties props, FileSystem fileSystem) { | ||
| this(new TypedProperties(props), new Configuration(), fileSystem); | ||
| this(new TypedProperties(props), fileSystem.getConf(), fileSystem); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…n and standardize configs (apache#4175)" This reverts commit 5f570ea.
…n and standardize configs (apache#4175)" This reverts commit 5f570ea.
…n and standardize configs (apache#4175)" This reverts commit 5f570ea.
…andardize configs (apache#4175) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <[email protected]> Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: Raymond Xu <[email protected]>
…andardize configs (apache#4175) - Refactor hive sync tool / config to use reflection and standardize configs Co-authored-by: sivabalan <[email protected]> Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: Raymond Xu <[email protected]>
Refactor hive sync tool / config to use reflection and standardize configs