-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups #3904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
manojpec
wants to merge
12
commits into
apache:master
from
manojpec:feature/HUDI-1295-meta-index-bloom-filter-partition-1
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lookups - Today, base files have bloom filter at their footers and index lookups have to load the base file to perform any bloom lookups. Though we have interval tree based file purging, we still end up in significant amount of base file read for the bloom filter for the end index lookups for the keys. This index lookup operation can be made more performant by having all the bloom filters in a new metadata partition and doing pointed lookups based on keys. - This PR adds the basic infrastructure for initializing the new bloom filter metadata partition, stubs for write and read code paths
…lookups - HoodieBackedTableMetadataWriter now prepares records for all its partitions using util methods and hands over the map of partition type to list of metadata records to the Spark/Flink commit routine.
…lookups - read/write path PoC - PoC for the metadata table based bloom filter index. - BloomIndexHelper findMatchingFilesForRecordKeys has been modified to look at the new metadata table bloom partition for filtering the keys
8f19898 to
f66b2f9
Compare
…lookups - read/write path PoC - Fixing HoodieBackedTableMetadata::getRecordByKeys() to consider all file groups in the partition. - Using BloomFilter type dynaming instead of SimpleBloomFilter
f66b2f9 to
bdb2fa6
Compare
…lookups - read/write path PoC - SparkHoodieBloomIndexHeler now uses HoodieBloomMetaIndexGroupedFunction to fetch metadata table bloom indices for various partitions and fileId in a sorted fashion so that each executor does only sequential access.
…lookup - write path PoC - In the metadata table, now we write a new partition for column stats to record the column level min, max ranges. - Metadata table now has 3 partitions in total - files, bloom-filter, col-stats - Column stats partition file group, record key and contents still needs to be fixed.
…lookup - read path PoC - Spark findMatchingFilesForRecordKeys now uses the new HoodieBloomMetaIndexColStatFunction to prune files using metadata table based column stats - After pruning the file list, it uses HoodieBloomMetaIndexGroupedFunction for the metadata table based bloom filter lookup to find the actual files for the keys and then does the file lookup to confirm the keys presence
…lookup - read path PoC - BloomFilter key and the ColumnStat key now includes the PartitionID - HoodieBloomIndex::lookupIndex() now uses column stats instead of loading all the BloomIndexFileInfo from the individual files for file pruning
…lookup - read path PoC - Adding lazy bloom filter metadata index loading feature when looking up index. - Added config to enable bloom filter lazy_load/bulk_load for index lookups - Added config to enable index lookup verbose logging - Added config to enable index lookup timer logging
…tats-index-poc [HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC
…lookup - read path PoC - Bug fix in external spillable map to avoid divide by zero case when estimating the record size and usages for the first time. - s3 path handling for inline fsutils
…tats-index-poc [HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC
Collaborator
Contributor
Author
|
This PR is taken over by #4352 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: New version of this PR at #4352
What is the purpose of the pull request
Today, base files have bloom filter at their footers and index lookups
have to load the base file to perform any bloom lookups. Though we have
interval tree based file purging, we still end up in significant amount
of base file read for the bloom filter for the end index lookups for the
keys. This index lookup operation can be made more performant by having
all the bloom filters in a new metadata partition and doing pointed
lookups based on keys.
This PR adds the basic infrastructure for initializing the new bloom filter
metadata partition, stubs for write and read code paths
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.