-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8622] fix performance regression of tag when written into consistent bucket index table #12389
Conversation
|
@danny0405 Hi, I found a another issue which failed the test, please have a look at #12394 |
…ucket index table 1. fix performance regression of tag when written into consistent bucket index table 2. unified the tag logic of the bucket index and lazily loaded the required mapper information Signed-off-by: TheR1sing3un <[email protected]>
1. missing Serializable Signed-off-by: TheR1sing3un <[email protected]>
…metadata file creation 1. fix concurrency problem during consistent-hash-bucket's initial metadata file creation Signed-off-by: TheR1sing3un <[email protected]>
74a954c to
3e7f216
Compare
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java
Outdated
Show resolved
Hide resolved
| Predicate<StoragePathInfo> hashingMetaCommitFilePredicate = pathInfo -> { | ||
| String filename = pathInfo.getPath().getName(); | ||
| return filename.contains(HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX); | ||
| return filename.endsWith(HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using HoodieStorage::createImmutableFileInPath will create a temp file with UUID-suffix, so we should ignore these temp files.
| return false; | ||
| boolean exist; | ||
| try { | ||
| exist = storage.exists(fullPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still return success for creating failed but the target file already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate why the test case fails without changing the parallelism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate why the test case fails without changing the parallelism?
For example, if the parallelism level is 2, two tasks load the index mapping relationship respectively. task1 finds that there is no valid metadata when loading metadata, and creates an initial metadata file. When task1 creates the file but has not finished writing the content, task2 loads the mapping and reads the empty file created by task1, causing an exception
1. fix the TestSparkConsistentBucketClustering Signed-off-by: TheR1sing3un <[email protected]>
1. fix the TestConsistentBucketIndex Signed-off-by: TheR1sing3un <[email protected]>
| private List<WriteStatus> writeData(String commitTime, int totalRecords, boolean doCommit) { | ||
| List<HoodieRecord> records = dataGen.generateInserts(commitTime, totalRecords); | ||
| JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 2); | ||
| JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the parallelism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it because the metadata file creation conflicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the parallelism?
To pass ut.
is it because the metadata file creation conflicts?
Yes, Because local fs does not create temporary files, it accesses the intermediate state when concurrently reading/creating metadata files.
…stent bucket index table (apache#12389) * fix: fix performance regression of tag when written into consistent bucket index table 1. fix performance regression of tag when written into consistent bucket index table 2. unified the tag logic of the bucket index and lazily loaded the required mapper information --------- Signed-off-by: TheR1sing3un <[email protected]> Co-authored-by: danny0405 <[email protected]>
Similar to #10130, collecting partition mapper in advance will affect performance, so we use lazy loading.
And I unify the tag logic of the bucket index for better code readability and scalability.
Change Logs
Describe context and summary for this change. Highlight if any code was copied.
Impact
Describe any public API or user-facing feature change or any performance impact.
none
Risk level (write none, low medium or high below)
none
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist