-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2589] RFC-37: Metadata table based bloom index #3989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-2589] RFC-37: Metadata table based bloom index #3989
Conversation
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the process calls for first landing the change here so you get your number fixed.
http://hudi.apache.org/contribute/rfc-process#proposing-the-rfc
same for @nsivabalan 's other RFC. Lets please read this carefully, follow it verbatim and proceed. :)
|
@vinothchandar sure, here is the RFC number claiming PR #3995 |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a feedback.
02f1d93 to
4fe7a88
Compare
68cba72 to
d9213ef
Compare
|
CI passed in the re-run of failed job. |
| 2. For all the involved partitions, load all its file list | ||
| 3. Level 1: Range pruning using `column_stats` index: | ||
| 1. For each of the record key, generate the column stats index lookup key based on the tuple | ||
| (__hoodie_record_key, partition name, file path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep in mind that Column Stats index is likely might not be very efficient in case when we don't maintain the invariant that the rows are ordered by that column (in this case _hoodie_record_key): in an ideal scenario base files should not overlap in the ranges, therefore more overlap we have b/w those the less efficient index becomes
d9213ef to
d23a140
Compare
b614f85 to
2686dc2
Compare
- Adding the RFC-37 doc for Metadata table based bloom index
- Adding a section on Schema evolution and its impact on the hash id based index keys.
- addressing review comments
- Updating the schema for index payloads and addressing other review comments.
- Updated the metadata schema used for the index
2686dc2 to
c57ad0d
Compare
Co-authored-by: Sivabalan Narayanan <[email protected]>
Co-authored-by: Sivabalan Narayanan <[email protected]>
What is the purpose of the pull request
Adding the RFC for metadata table based bloom index.
Brief change log
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.