-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis #9037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
codope
reviewed
Jun 22, 2023
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Outdated
Show resolved
Hide resolved
5a4e0b4 to
8582d7a
Compare
8582d7a to
81aea3c
Compare
danny0405
reviewed
Jun 22, 2023
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java
Outdated
Show resolved
Hide resolved
danny0405
reviewed
Jun 22, 2023
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java
Show resolved
Hide resolved
danny0405
reviewed
Jun 22, 2023
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java
Show resolved
Hide resolved
danny0405
reviewed
Jun 26, 2023
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java
Show resolved
Hide resolved
Collaborator
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
Show resolved
Hide resolved
danny0405
approved these changes
Jun 27, 2023
Contributor
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
Hfile format is known to be performant for on-demand lookup and prefix based lookup. But due to refactoring, we made minor tweaks to hfile scanner apis. Apparently, it could cause a latency hit w/ on-demand lookup when compared to full scan. The issue is, we switched reseekTo(Key) to seekTo(Key). Difference between seekTo() and reseekTo() is, both of them moves the cursor to the key of interest, but at the end of the call, seekTo() will rewind the cursor to the beginning of data block in Hfile, while reseekTo will leave the cursor at the same position.
Ref from HfileScanner docs:
We sort all the keys before any on-demand look up to ensure we would not need to go back in position once the search for a given key. Bcoz, next key is going to be lexicographically greater than the current key being searched.
This patch fixes it back to reseekTo(Key).
Also, found another bug where in we missed to sort keys to look up in base files in HoodieBackedTableMetadata in recent patch for RecordLevelIndex. Fixed the sorting towards this. Also, removed sorting from HfileData blocks, since now we have repeated sorting. So, the fix is to sort once per file Slice within lookupKeysFromFileSlice, so that individual look ups (base file, each log files) does not need to sort.
Fixed argument names and variable names in downstream consumers to avoid ambiguity on whether they keys are sorted or not.
Impact
This should improve the on-demand lookup with Hfile by a large factor. From our micro-benchmarking, we found latency improvement from 10s of seconds to 100 to 200ms(when 1000s are keys are looked up in a Hfile containing 100k entries).
Risk level (write none, low medium or high below)
medium
Documentation Update
N/A
Contributor's checklist