[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389

TheR1sing3un · 2024-12-01T19:50:54Z

Similar to #10130, collecting partition mapper in advance will affect performance, so we use lazy loading.
And I unify the tag logic of the bucket index for better code readability and scalability.

Change Logs

fix performance regression of tag when written into consistent bucket index table
unified the tag logic of the bucket index and lazily loaded the required mapper information
Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.
none

Risk level (write none, low medium or high below)

none
If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

TheR1sing3un · 2024-12-02T08:40:55Z

@danny0405 Hi, I found a another issue which failed the test, please have a look at #12394

…ucket index table 1. fix performance regression of tag when written into consistent bucket index table 2. unified the tag logic of the bucket index and lazily loaded the required mapper information Signed-off-by: TheR1sing3un <[email protected]>

1. missing Serializable Signed-off-by: TheR1sing3un <[email protected]>

…metadata file creation 1. fix concurrency problem during consistent-hash-bucket's initial metadata file creation Signed-off-by: TheR1sing3un <[email protected]>

hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java

TheR1sing3un · 2024-12-12T08:36:40Z

...udi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java

      Predicate<StoragePathInfo> hashingMetaCommitFilePredicate = pathInfo -> {
        String filename = pathInfo.getPath().getName();
-        return filename.contains(HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX);
+        return filename.endsWith(HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX);


using HoodieStorage::createImmutableFileInPath will create a temp file with UUID-suffix, so we should ignore these temp files.

TheR1sing3un · 2024-12-12T08:38:16Z

...udi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java

-      return false;
+      boolean exist;
+      try {
+        exist = storage.exists(fullPath);


still return success for creating failed but the target file already exists.

can you elaborate why the test case fails without changing the parallelism?

can you elaborate why the test case fails without changing the parallelism?

For example, if the parallelism level is 2, two tasks load the index mapping relationship respectively. task1 finds that there is no valid metadata when loading metadata, and creates an initial metadata file. When task1 creates the file but has not finished writing the content, task2 loads the mapping and reads the empty file created by task1, causing an exception

1. fix the TestSparkConsistentBucketClustering Signed-off-by: TheR1sing3un <[email protected]>

1. fix the TestConsistentBucketIndex Signed-off-by: TheR1sing3un <[email protected]>

danny0405 · 2024-12-14T08:32:40Z

...ce/hudi-spark/src/test/java/org/apache/hudi/client/functional/TestConsistentBucketIndex.java

  private List<WriteStatus> writeData(String commitTime, int totalRecords, boolean doCommit) {
    List<HoodieRecord> records = dataGen.generateInserts(commitTime, totalRecords);
-    JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 2);
+    JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1);


why change the parallelism?

is it because the metadata file creation conflicts?

why change the parallelism?

To pass ut.

is it because the metadata file creation conflicts?

Yes, Because local fs does not create temporary files, it accesses the intermediate state when concurrently reading/creating metadata files.

hudi-bot · 2024-12-16T07:03:36Z

CI report:

b2af31f Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…stent bucket index table (apache#12389) * fix: fix performance regression of tag when written into consistent bucket index table 1. fix performance regression of tag when written into consistent bucket index table 2. unified the tag logic of the bucket index and lazily loaded the required mapper information --------- Signed-off-by: TheR1sing3un <[email protected]> Co-authored-by: danny0405 <[email protected]>

github-actions bot added the size:M PR with lines of changes in (100, 300] label Dec 1, 2024

TheR1sing3un changed the title ~~[HUDI-8622]: fix performance regression of tag when written into consistent bucket index table~~ [HUDI-8622][DRAFT]: fix performance regression of tag when written into consistent bucket index table Dec 2, 2024

TheR1sing3un and others added 4 commits December 12, 2024 15:30

Cosmetic changes

624548e

fix: missing Serializable

1a1e732

1. missing Serializable Signed-off-by: TheR1sing3un <[email protected]>

fix: fix concurrency problem during consistent-hash-bucket's initial …

3e7f216

…metadata file creation 1. fix concurrency problem during consistent-hash-bucket's initial metadata file creation Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un force-pushed the optimized_reduce_partition_collect branch from 74a954c to 3e7f216 Compare December 12, 2024 08:33

TheR1sing3un commented Dec 12, 2024

View reviewed changes

hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java Outdated Show resolved Hide resolved

TheR1sing3un changed the title ~~[HUDI-8622][DRAFT]: fix performance regression of tag when written into consistent bucket index table~~ [HUDI-8622] fix performance regression of tag when written into consistent bucket index table Dec 12, 2024

TheR1sing3un commented Dec 12, 2024

View reviewed changes

TheR1sing3un added 2 commits December 13, 2024 10:54

fix: fix the TestSparkConsistentBucketClustering

dec2c4d

1. fix the TestSparkConsistentBucketClustering Signed-off-by: TheR1sing3un <[email protected]>

fix: fix the TestConsistentBucketIndex

3b3ff57

1. fix the TestConsistentBucketIndex Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un requested a review from danny0405 December 13, 2024 03:01

danny0405 reviewed Dec 14, 2024

View reviewed changes

danny0405 added 2 commits December 16, 2024 11:26

Force temp file creation when creating hashing metadata

bfd61f5

Check the file existence before creation

b2af31f

danny0405 approved these changes Dec 16, 2024

View reviewed changes

danny0405 merged commit 76dbdaa into apache:master Dec 16, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389

[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389

Uh oh!

TheR1sing3un commented Dec 1, 2024

Uh oh!

TheR1sing3un commented Dec 2, 2024

Uh oh!

Uh oh!

TheR1sing3un Dec 12, 2024 •

edited

Loading

Uh oh!

TheR1sing3un Dec 12, 2024

Uh oh!

danny0405 Dec 14, 2024

Uh oh!

TheR1sing3un Dec 14, 2024

Uh oh!

danny0405 Dec 14, 2024

Uh oh!

danny0405 Dec 14, 2024

Uh oh!

TheR1sing3un Dec 14, 2024

Uh oh!

hudi-bot commented Dec 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389

[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389

Uh oh!

Conversation

TheR1sing3un commented Dec 1, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

TheR1sing3un commented Dec 2, 2024

Uh oh!

Uh oh!

TheR1sing3un Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

danny0405 Dec 14, 2024

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un Dec 14, 2024

Choose a reason for hiding this comment

Uh oh!

danny0405 Dec 14, 2024

Choose a reason for hiding this comment

Uh oh!

danny0405 Dec 14, 2024

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un Dec 14, 2024

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Dec 16, 2024

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TheR1sing3un Dec 12, 2024 •

edited

Loading