Skip to content

Conversation

@TheR1sing3un
Copy link
Member

Similar to #10130, collecting partition mapper in advance will affect performance, so we use lazy loading.
And I unify the tag logic of the bucket index for better code readability and scalability.

Change Logs

  1. fix performance regression of tag when written into consistent bucket index table
  2. unified the tag logic of the bucket index and lazily loaded the required mapper information
    Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.
none

Risk level (write none, low medium or high below)

none
If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Dec 1, 2024
@TheR1sing3un TheR1sing3un changed the title [HUDI-8622]: fix performance regression of tag when written into consistent bucket index table [HUDI-8622][DRAFT]: fix performance regression of tag when written into consistent bucket index table Dec 2, 2024
@TheR1sing3un
Copy link
Member Author

@danny0405 Hi, I found a another issue which failed the test, please have a look at #12394

TheR1sing3un and others added 4 commits December 12, 2024 15:30
…ucket index table

1. fix performance regression of tag when written into consistent bucket index table
2. unified the tag logic of the bucket index and lazily loaded the required mapper information

Signed-off-by: TheR1sing3un <[email protected]>
1. missing Serializable

Signed-off-by: TheR1sing3un <[email protected]>
…metadata file creation

1. fix concurrency problem during consistent-hash-bucket's initial metadata file creation

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un TheR1sing3un force-pushed the optimized_reduce_partition_collect branch from 74a954c to 3e7f216 Compare December 12, 2024 08:33
@TheR1sing3un TheR1sing3un changed the title [HUDI-8622][DRAFT]: fix performance regression of tag when written into consistent bucket index table [HUDI-8622] fix performance regression of tag when written into consistent bucket index table Dec 12, 2024
Predicate<StoragePathInfo> hashingMetaCommitFilePredicate = pathInfo -> {
String filename = pathInfo.getPath().getName();
return filename.contains(HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX);
return filename.endsWith(HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX);
Copy link
Member Author

@TheR1sing3un TheR1sing3un Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using HoodieStorage::createImmutableFileInPath will create a temp file with UUID-suffix, so we should ignore these temp files.

return false;
boolean exist;
try {
exist = storage.exists(fullPath);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still return success for creating failed but the target file already exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate why the test case fails without changing the parallelism?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate why the test case fails without changing the parallelism?

For example, if the parallelism level is 2, two tasks load the index mapping relationship respectively. task1 finds that there is no valid metadata when loading metadata, and creates an initial metadata file. When task1 creates the file but has not finished writing the content, task2 loads the mapping and reads the empty file created by task1, causing an exception

1. fix the TestSparkConsistentBucketClustering

Signed-off-by: TheR1sing3un <[email protected]>
1. fix the TestConsistentBucketIndex

Signed-off-by: TheR1sing3un <[email protected]>
private List<WriteStatus> writeData(String commitTime, int totalRecords, boolean doCommit) {
List<HoodieRecord> records = dataGen.generateInserts(commitTime, totalRecords);
JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 2);
JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the parallelism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it because the metadata file creation conflicts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the parallelism?

To pass ut.

is it because the metadata file creation conflicts?

Yes, Because local fs does not create temporary files, it accesses the intermediate state when concurrently reading/creating metadata files.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 76dbdaa into apache:master Dec 16, 2024
43 checks passed
TheR1sing3un added a commit to TheR1sing3un/hudi that referenced this pull request Feb 12, 2025
…stent bucket index table (apache#12389)

* fix: fix performance regression of tag when written into consistent bucket index table

1. fix performance regression of tag when written into consistent bucket index table
2. unified the tag logic of the bucket index and lazily loaded the required mapper information

---------

Signed-off-by: TheR1sing3un <[email protected]>
Co-authored-by: danny0405 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants