-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2704] Adding RFC-37 for Metadata based bloom index #3932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| original file group location. And this index will leverage both the partitions to deduce the record key => file name mappings. | ||
|
|
||
| ``` | ||
| Input: JavaRdd<HoodieKey> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the input/output be generic to satisfy more engines?
|
|
||
| <img src="metadata_index_col_stats.png" alt="Column Stats Partition" width="600"/> | ||
|
|
||
| We have to encode column names, filenames etc to IDs to save storage and to exploit compression. We will update the RFC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding column names, partition names may not be required as HFile compresses blocks of key-value data. So repeated string of column names, etc will compress well.
| Requirements:<br> | ||
| Given a list of FileIDs, return their bloom filters | ||
| ``` | ||
| Key format: [PartitionId][FileId] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since fileId is UUID based, can we assume that fileIDs are unique within HUDI? If so, the partitionId is not required here.
But prefixing with partitionID may lead to better perf as all the fileIDs for a partition will be together in same block.
| Hudi maintains indices to locate/map incoming records to file groups during writes. Most commonly | ||
| used record index is the HoodieBloomIndex. For larger installations and for global index types, performance might be an issue | ||
| due to loading of bloom from large number of data files and due to throttling issues with some of the cloud stores. We are proposing to | ||
| build a new Metadata index (metadata table based bloom index) to boost the performance of existing bloom index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a record-level-index for this functionality? Is it because of storage requirements? Wouldn't a record level index be faster than using a bloom based index?
|
Thanks @nsivabalan for the initial RFC doc. Had an offline discussion and I am taking over the remaining work in this RFC and addressing the comments. The new PR is at #3989. |
|
@leesf @prashantwason thanks for the review, will address them in #3989 |






What is the purpose of the pull request
Adding an RFC for metadata based bloom index.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.