[HUDI-3042] Refactoring clustering executors #4847

xushiyan · 2022-02-18T10:46:29Z

Extract common code from

SparkExecuteClusteringCommitActionExecutor.java
JavaExecuteClusteringCommitActionExecutor.java

to BaseCommitActionExecutor.java

Remove redundant engine specific classes

SparkClusteringPlanActionExecutor.java
JavaClusteringPlanActionExecutor.java

use ClusteringPlanActionExecutor.java instead.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xushiyan · 2022-02-18T10:55:49Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

-  protected Map<String, List<String>> getPartitionToReplacedFileIds(HoodieWriteMetadata<JavaRDD<WriteStatus>> writeMetadata) {
-    Set<HoodieFileGroupId> newFilesWritten = writeMetadata.getWriteStats().get().stream()
-        .map(s -> new HoodieFileGroupId(s.getPartitionPath(), s.getFileId())).collect(Collectors.toSet());
-    // for the below execution strategy, new file group id would be same as old file group id
-    if (SparkSingleFileSortExecutionStrategy.class.getName().equals(config.getClusteringExecutionStrategyClass())) {
-      return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan)
-          .collect(Collectors.groupingBy(fg -> fg.getPartitionPath(), Collectors.mapping(fg -> fg.getFileId(), Collectors.toList())));
-    }
-    return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan)
-        .filter(fg -> !newFilesWritten.contains(fg))
-        .collect(Collectors.groupingBy(fg -> fg.getPartitionPath(), Collectors.mapping(fg -> fg.getFileId(), Collectors.toList())));
-  }


extracted to BaseCommitActionExecutor.java

xushiyan · 2022-02-18T10:56:05Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

-  private void validateWriteResult(HoodieWriteMetadata<JavaRDD<WriteStatus>> writeMetadata) {
-    if (writeMetadata.getWriteStatuses().isEmpty()) {
-      throw new HoodieClusteringException("Clustering plan produced 0 WriteStatus for " + instantTime
-          + " #groups: " + clusteringPlan.getInputGroups().size() + " expected at least "
-          + clusteringPlan.getInputGroups().stream().mapToInt(HoodieClusteringGroup::getNumOutputFileGroups).sum()
-          + " write statuses");
-    }


extracted to BaseCommitActionExecutor.java

xushiyan · 2022-02-18T10:56:51Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

-    HoodieInstant instant = HoodieTimeline.getReplaceCommitRequestedInstant(instantTime);
-    // Mark instant as clustering inflight
-    table.getActiveTimeline().transitionReplaceRequestedToInflight(instant, Option.empty());
-    table.getMetaClient().reloadActiveTimeline();
-
-    final Schema schema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getSchema()));
-    HoodieWriteMetadata<JavaRDD<WriteStatus>> writeMetadata = ((ClusteringExecutionStrategy<T, JavaRDD<HoodieRecord<? extends HoodieRecordPayload>>, JavaRDD<HoodieKey>, JavaRDD<WriteStatus>>)
-        ReflectionUtils.loadClass(config.getClusteringExecutionStrategyClass(),
-            new Class<?>[] {HoodieTable.class, HoodieEngineContext.class, HoodieWriteConfig.class}, table, context, config))
-        .performClustering(clusteringPlan, schema, instantTime);
-    JavaRDD<WriteStatus> writeStatusRDD = writeMetadata.getWriteStatuses();
-    JavaRDD<WriteStatus> statuses = updateIndex(writeStatusRDD, writeMetadata);
-    writeMetadata.setWriteStats(statuses.map(WriteStatus::getStat).collect());
-    writeMetadata.setPartitionToReplaceFileIds(getPartitionToReplacedFileIds(writeMetadata));
-    commitOnAutoCommit(writeMetadata);
-    if (!writeMetadata.getCommitMetadata().isPresent()) {
-      HoodieCommitMetadata commitMetadata = CommitUtils.buildMetadata(writeMetadata.getWriteStats().get(), writeMetadata.getPartitionToReplaceFileIds(),
-          extraMetadata, operationType, getSchemaToStoreInCommit(), getCommitActionType());
-      writeMetadata.setCommitMetadata(Option.of(commitMetadata));
-    }
-    return writeMetadata;


extracted to executeClustering() in BaseCommitActionExecutor.java

danny0405 · 2022-02-21T03:36:22Z

...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java

      Iterator<HoodieRecord<T>> recordItr) throws IOException;
+
+  protected HoodieWriteMetadata<HoodieData<WriteStatus>> executeClustering(HoodieClusteringPlan clusteringPlan) {
+    HoodieInstant instant = HoodieTimeline.getReplaceCommitRequestedInstant(instantTime);


The BaseCommitActionExecutor responsibilities are a bit confusing, it handles regular writing process such as insert, upsert and with this path clustering, then what about the compaction?

Should we make a new base class for table services then ?

I think the goal is to revamp the commit executors and write pipeline altogether later on, so the refactoring here is limited to code reuse. @xushiyan is that the case?

@danny0405 agreed that it looks like some mixed responsibilities there. i'll make clearer separation in https://issues.apache.org/jira/browse/HUDI-2439

hudi-bot · 2022-02-23T23:08:44Z

CI report:

7306cb4 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM

yihua · 2022-02-23T23:06:57Z

...ient/src/main/java/org/apache/hudi/client/clustering/run/strategy/JavaExecutionStrategy.java

 */
 public abstract class JavaExecutionStrategy<T extends HoodieRecordPayload<T>>
-    extends ClusteringExecutionStrategy<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> {
+    extends ClusteringExecutionStrategy<T, HoodieData<HoodieRecord<T>>, HoodieData<HoodieKey>, HoodieData<WriteStatus>> {


Is this Java-specific class going to be removed as a follow-up?

@yihua yes i should make another PR to deal with ClusteringExecutionStrategy and subclasses, which can be a good separation.

xushiyan requested a review from yihua February 18, 2022 10:52

xushiyan commented Feb 18, 2022

View reviewed changes

xushiyan force-pushed the HUDI-3042-refactor-clustering-executor branch from c89db11 to 94c384e Compare February 18, 2022 11:07

yihua self-assigned this Feb 18, 2022

[HUDI-3042] Refactoring clustering executors

7306cb4

xushiyan force-pushed the HUDI-3042-refactor-clustering-executor branch from 94c384e to 7306cb4 Compare February 21, 2022 01:58

danny0405 reviewed Feb 21, 2022

View reviewed changes

apache deleted a comment from hudi-bot Feb 23, 2022

yihua approved these changes Feb 23, 2022

View reviewed changes

xushiyan merged commit b50f4b4 into apache:master Feb 25, 2022

xushiyan deleted the HUDI-3042-refactor-clustering-executor branch February 25, 2022 13:39

yihua mentioned this pull request Mar 10, 2022

[HUDI-2207] Support independent flink hudi clustering function #3599

Merged

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3042] Refactor clustering executors (apache#4847)

aec2280

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-3042] Refactoring clustering executors #4847

[HUDI-3042] Refactoring clustering executors #4847

Uh oh!

xushiyan commented Feb 18, 2022 •

edited

Loading

Uh oh!

xushiyan Feb 18, 2022

Uh oh!

xushiyan Feb 18, 2022

Uh oh!

xushiyan Feb 18, 2022

Uh oh!

danny0405 Feb 21, 2022

Uh oh!

yihua Feb 25, 2022

Uh oh!

xushiyan Feb 25, 2022

Uh oh!

hudi-bot commented Feb 23, 2022

Uh oh!

yihua left a comment

Uh oh!

yihua Feb 23, 2022

Uh oh!

xushiyan Feb 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[HUDI-3042] Refactoring clustering executors #4847

[HUDI-3042] Refactoring clustering executors #4847

Uh oh!

Conversation

xushiyan commented Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 23, 2022

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xushiyan commented Feb 18, 2022 •

edited

Loading