[HUDI-1877] Add support in clustering to not change record location #2918

satishkotha · 2021-05-06T23:54:29Z

What is the purpose of the pull request

Add support for reusing fileId in clustering execution strategy. This is strategy specific. Default is still to create new files

Brief change log

Some datasets rely on external index. We cannot change record location for clustering (because external index doesn't support update). We can still take advantage of clustering by doing 'local' sorting within each file. Add support for such strategies.

Also, made small changes on how metadata is generated after clustering is complete. (metadata is getting generated redundantly twice before. Removed 1 to make it simple).

Verify this pull request

This change added tests

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-commenter · 2021-05-07T00:33:43Z

Codecov Report

Merging #2918 (32dbe35) into master (0284cde) will increase coverage by 15.29%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##             master    #2918       +/-   ##
=============================================
+ Coverage     54.23%   69.53%   +15.29%     
+ Complexity     3810      374     -3436     
=============================================
  Files           488       54      -434     
  Lines         23574     2002    -21572     
  Branches       2510      237     -2273     
=============================================
- Hits          12786     1392    -11394     
+ Misses         9636      478     -9158     
+ Partials       1152      132     -1020

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`69.53% <ø> (-0.05%)`	`374.00 <ø> (-1.00)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...apache/hudi/utilities/deltastreamer/DeltaSync.java	`71.08% <0.00%> (-0.35%)`	`55.00% <0.00%> (-1.00%)`
...n/java/org/apache/hudi/common/HoodieCleanStat.java
...n/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
...hadoop/realtime/RealtimeCompactedRecordReader.java
...che/hudi/exception/InvalidHoodiePathException.java
.../hudi/async/SparkStreamingAsyncCompactService.java
...org/apache/hudi/common/model/TableServiceType.java
.../common/util/queue/FunctionBasedQueueProducer.java
...ava/org/apache/hudi/cli/commands/UtilsCommand.java
.../hudi/table/format/cow/ParquetSplitReaderUtil.java
... and 425 more

…or custom strategies

lw309637554 · 2021-05-08T15:28:53Z

...hudi-spark-client/src/test/java/org/apache/hudi/ClusteringIdentityTestExecutionStrategy.java

+      final Map<String, String> strategyParams,
+      final Schema schema,
+      final List<HoodieFileGroupId> inputFileIds) {
+    if (inputRecords.getNumPartitions() != 1 || inputFileIds.size() != 1) {


if must one fileid, each clustering group should just have one file group? but not see the limit in clustering scheduling

yes, this is enforced by setting group size limit to a small number. See unit test added .withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group

Can we support a other config such as filegroupLocalSort? Because reuse withClusteringMaxBytesInGroup to set it so small , users may be confuse.

lw309637554 · 2021-05-08T15:31:31Z

@satishkotha hello , have some doubt

Just see add a test strategy . Will a formal strategy be added later？
This PR is to support which Index?
If every file group just transfrom to a same name file group. If the small files can not merge ?

satishkotha · 2021-05-10T20:25:35Z

@satishkotha hello , have some doubt

Just see add a test strategy . Will a formal strategy be added later？

This PR is to support which Index?

If every file group just transfrom to a same name file group. If the small files can not merge ?

@lw309637554

Yes, the actual strategy can be added easily if we agree on high level change
This is to support HBaseIndex, which does not support update for record location
yes, you are right. merging strategy cannot be applied to tables that use HBaseIndex. We can still local 'file-level' sorting i.e., sorting records in each data file by specified column so only one block (row group) needs to be read for queries.

Let me know if you any other questions/comments.

lw309637554 · 2021-05-13T14:40:29Z

...-spark-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java

+  public void testClusteringWithOneFilePerGroup() throws Exception {
+    HoodieClusteringConfig clusteringConfig = HoodieClusteringConfig.newBuilder().withClusteringMaxNumGroups(10)
+        .withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group
+        .withClusteringExecutionStrategyClass("org.apache.hudi.ClusteringIdentityTestExecutionStrategy")


Can add a config

This is just a unit test. Will provide another schedule clustering strategy as part of another PR to limit number of files per group.

lw309637554 · 2021-05-13T14:43:39Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java

+import java.util.Map;
+
+/**
+ * A HoodieCreateHandle which writes all data into a single file.


HoodieCreateFixedHandle

This is bit of a misnomer. Even HoodieCreateHandle only writes to a single file.

Rename: HoodieUnboundedCreateHandle or something that captures that intent , that this does not respect the sizing aspects.

lw309637554 · 2021-05-13T14:44:00Z

@satishkotha hello , have some doubt

Just see add a test strategy . Will a formal strategy be added later？

This PR is to support which Index?

If every file group just transfrom to a same name file group. If the small files can not merge ?

@lw309637554

Yes, the actual strategy can be added easily if we agree on high level change

This is to support HBaseIndex, which does not support update for record location

yes, you are right. merging strategy cannot be applied to tables that use HBaseIndex. We can still local 'file-level' sorting i.e., sorting records in each data file by specified column so only one block (row group) needs to be read for queries.

Let me know if you any other questions/comments.
@satishkotha
high level change is OK . Just have a other two comments

".withClusteringMaxBytesInGroup(10) // set small number so each file is considered as separate clustering group" , Can we add aother config
If the sort will support in HoodieCreateFixedHandle?

satishkotha · 2021-05-13T21:05:31Z

Can we add aother config

Yes that will actually be provided as separate strategy.

If the sort will support in HoodieCreateFixedHandle?

This part will be provided in execution strategy. Right now i only add test strategy which doesn't support sorting. I'm going to work on adding real strategy to sort.

Both above strategies will be sent as another PR. Let me know if that works.

lw309637554 · 2021-05-15T14:34:31Z

@satishkotha LGTM

vinothchandar · 2021-05-25T02:04:39Z

We cannot change record location for clustering (because external index doesn't support update). We can still take advantage of clustering by doing 'local' sorting within each file.

This can be achieved by sorting during original write time, correct?

vinothchandar

I would like to understand if more changes are needed to support what we are shooting for here. or you plan to keep the strategy internal and just these changes are enough for you do it in production?

I would like to avoid lot of specific changes to support not changing the record location and have it as a separate strategy. Is that how you are thinking about it as well?

If so, then we can address the naming and other reuse comments and land this. Please let me know.

vinothchandar · 2021-05-25T02:10:59Z

...client/hudi-client-common/src/main/java/org/apache/hudi/io/CreateFixedFileHandleFactory.java

+ *
+ * Please use this with caution. This can end up creating very large files if not used correctly.
+ */
+public class CreateFixedFileHandleFactory<T extends HoodieRecordPayload, I, K, O> extends WriteHandleFactory<T, I, K, O> {


can we subclass this from CreateHandleFactory? or call this SingleFileCreateHandleFactory?

vinothchandar · 2021-05-25T02:12:58Z

...client/hudi-client-common/src/main/java/org/apache/hudi/io/CreateFixedFileHandleFactory.java

+  }
+
+  @Override
+  public HoodieWriteHandle<T, I, K, O> create(final HoodieWriteConfig hoodieConfig, final String commitTime,


wondering why we need this actually. Would n't just passing Long.MAX_VALUE as the target file size, get the create handle to do this?

vinothchandar · 2021-05-25T02:15:08Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java

+  }
+
+  @Override
+  public boolean canWrite(HoodieRecord record) {


Let's just reuse CreateHandle with a large target file size? if we are doing all this for just a specific clustering strategy?

vinothchandar · 2021-05-25T02:17:45Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateFixedHandle.java

+import java.util.Map;
+
+/**
+ * A HoodieCreateHandle which writes all data into a single file.


This is bit of a misnomer. Even HoodieCreateHandle only writes to a single file.

Rename: HoodieUnboundedCreateHandle or something that captures that intent , that this does not respect the sizing aspects.

vinothchandar · 2021-05-25T02:18:39Z

...src/main/java/org/apache/hudi/table/action/cluster/strategy/ClusteringExecutionStrategy.java

   */
  public abstract O performClustering(final I inputRecords, final int numOutputGroups, final String instantTime,
-                                      final Map<String, String> strategyParams, final Schema schema);
+                                      final Map<String, String> strategyParams, final Schema schema, final List<HoodieFileGroupId> inputFileIds);


can you please add javadocs for this method explaining what each param is.

vinothchandar · 2021-05-25T02:22:09Z

...-spark-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java

    assertEquals(0, fileIdIntersection.size());

-    config = getConfigBuilder(HoodieFailedWritesCleaningPolicy.LAZY).withAutoCommit(completeClustering)
+    config = getConfigBuilder(HoodieFailedWritesCleaningPolicy.LAZY).withAutoCommit(false)


so we don't honor completeClustering anymore? Not following why this change was needed

vinothchandar · 2021-05-25T02:23:42Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

      JavaSparkContext jsc = HoodieSparkEngineContext.getSparkContext(context);
      JavaRDD<HoodieRecord<? extends HoodieRecordPayload>> inputRecords = readRecordsForGroup(jsc, clusteringGroup);
      Schema readerSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getSchema()));
+      List<HoodieFileGroupId> inputFileIds = clusteringGroup.getSlices().stream()


so the input file ids are already in the serialized plan? This PR just passes this around additionally?

vinothchandar · 2021-05-25T02:24:25Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

  protected Map<String, List<String>> getPartitionToReplacedFileIds(JavaRDD<WriteStatus> writeStatuses) {
-    return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan).collect(
-        Collectors.groupingBy(fg -> fg.getPartitionPath(), Collectors.mapping(fg -> fg.getFileId(), Collectors.toList())));
+    Set<HoodieFileGroupId> newFilesWritten = new HashSet(writeStatuses.map(s -> s.getFileId()).collect());


rename: newFileIds

vinothchandar · 2021-05-25T02:25:35Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

-        Collectors.groupingBy(fg -> fg.getPartitionPath(), Collectors.mapping(fg -> fg.getFileId(), Collectors.toList())));
+    Set<HoodieFileGroupId> newFilesWritten = new HashSet(writeStatuses.map(s -> s.getFileId()).collect());
+    return ClusteringUtils.getFileGroupsFromClusteringPlan(clusteringPlan)
+        .filter(fg -> !newFilesWritten.contains(fg))


sorry. not following. why do we need this filter?

vinothchandar · 2021-05-25T02:26:01Z

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

    return hoodieRecord;
  }

-  private HoodieWriteMetadata<JavaRDD<WriteStatus>> buildWriteMetadata(JavaRDD<WriteStatus> writeStatusJavaRDD) {


this was removed, because the constructor does the same job?

satishkotha · 2021-10-22T04:25:42Z

I think all changes in this have already been merged as part of #3419. Closing this.

[HUDI-1877] Add support in clustering to not change record location f…

32dbe35

…or custom strategies

satishkotha force-pushed the sk/reuseFileId branch from a5736cc to 32dbe35 Compare May 7, 2021 17:45

lw309637554 reviewed May 8, 2021

View reviewed changes

nsivabalan added the priority:high Significant impact; potential bugs label May 11, 2021

lw309637554 reviewed May 13, 2021

View reviewed changes

vinothchandar assigned vinothchandar and lw309637554 May 18, 2021

vinothchandar reviewed May 25, 2021

View reviewed changes

satishkotha closed this Oct 22, 2021

[HUDI-1877] Add support in clustering to not change record location #2918

[HUDI-1877] Add support in clustering to not change record location #2918

Uh oh!

Conversation

satishkotha commented May 6, 2021

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-commenter commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw309637554 commented May 8, 2021

Uh oh!

satishkotha commented May 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw309637554 commented May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satishkotha commented May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lw309637554 commented May 15, 2021

Uh oh!

vinothchandar commented May 25, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satishkotha commented Oct 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented May 7, 2021 •

edited

Loading

lw309637554 commented May 13, 2021 •

edited

Loading

satishkotha commented May 13, 2021 •

edited

Loading