[WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups #3904

manojpec · 2021-11-02T08:41:23Z

NOTE: New version of this PR at #4352

What is the purpose of the pull request

Today, base files have bloom filter at their footers and index lookups
have to load the base file to perform any bloom lookups. Though we have
interval tree based file purging, we still end up in significant amount
of base file read for the bloom filter for the end index lookups for the
keys. This index lookup operation can be made more performant by having
all the bloom filters in a new metadata partition and doing pointed
lookups based on keys.
This PR adds the basic infrastructure for initializing the new bloom filter
metadata partition, stubs for write and read code paths

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…lookups - Today, base files have bloom filter at their footers and index lookups have to load the base file to perform any bloom lookups. Though we have interval tree based file purging, we still end up in significant amount of base file read for the bloom filter for the end index lookups for the keys. This index lookup operation can be made more performant by having all the bloom filters in a new metadata partition and doing pointed lookups based on keys. - This PR adds the basic infrastructure for initializing the new bloom filter metadata partition, stubs for write and read code paths

…lookups - HoodieBackedTableMetadataWriter now prepares records for all its partitions using util methods and hands over the map of partition type to list of metadata records to the Spark/Flink commit routine.

…lookups - read/write path PoC - PoC for the metadata table based bloom filter index. - BloomIndexHelper findMatchingFilesForRecordKeys has been modified to look at the new metadata table bloom partition for filtering the keys

…lookups - read/write path PoC - Fixing HoodieBackedTableMetadata::getRecordByKeys() to consider all file groups in the partition. - Using BloomFilter type dynaming instead of SimpleBloomFilter

…lookups - read/write path PoC - SparkHoodieBloomIndexHeler now uses HoodieBloomMetaIndexGroupedFunction to fetch metadata table bloom indices for various partitions and fileId in a sorted fashion so that each executor does only sequential access.

…lookup - write path PoC - In the metadata table, now we write a new partition for column stats to record the column level min, max ranges. - Metadata table now has 3 partitions in total - files, bloom-filter, col-stats - Column stats partition file group, record key and contents still needs to be fixed.

…lookup - read path PoC - Spark findMatchingFilesForRecordKeys now uses the new HoodieBloomMetaIndexColStatFunction to prune files using metadata table based column stats - After pruning the file list, it uses HoodieBloomMetaIndexGroupedFunction for the metadata table based bloom filter lookup to find the actual files for the keys and then does the file lookup to confirm the keys presence

…lookup - read path PoC - BloomFilter key and the ColumnStat key now includes the PartitionID - HoodieBloomIndex::lookupIndex() now uses column stats instead of loading all the BloomIndexFileInfo from the individual files for file pruning

…lookup - read path PoC - Adding lazy bloom filter metadata index loading feature when looking up index. - Added config to enable bloom filter lazy_load/bulk_load for index lookups - Added config to enable index lookup verbose logging - Added config to enable index lookup timer logging

…tats-index-poc [HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC

…lookup - read path PoC - Bug fix in external spillable map to avoid divide by zero case when estimating the record size and usages for the first time. - s3 path handling for inline fsutils

…tats-index-poc [HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC

hudi-bot · 2021-11-09T18:55:14Z

CI report:

9cba53c Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

manojpec · 2021-12-17T09:06:56Z

This PR is taken over by #4352

nsivabalan added the priority:blocker Production down; release blocker label Nov 3, 2021

manojpec added 3 commits November 5, 2021 17:51

[HUDI-1295] Metadata Index - Bloom filter metadata to speed up index …

696d522

…lookups - HoodieBackedTableMetadataWriter now prepares records for all its partitions using util methods and hands over the map of partition type to list of metadata records to the Spark/Flink commit routine.

manojpec force-pushed the feature/HUDI-1295-meta-index-bloom-filter-partition-1 branch from 8f19898 to f66b2f9 Compare November 6, 2021 00:53

[HUDI-2700] Metadata Index - Bloom filter metadata to speed up index …

bdb2fa6

…lookups - read/write path PoC - Fixing HoodieBackedTableMetadata::getRecordByKeys() to consider all file groups in the partition. - Using BloomFilter type dynaming instead of SimpleBloomFilter

manojpec force-pushed the feature/HUDI-1295-meta-index-bloom-filter-partition-1 branch from f66b2f9 to bdb2fa6 Compare November 6, 2021 01:04

manojpec added 6 commits November 6, 2021 02:11

Merge pull request #2 from manojpec/feature/HUDI-2705-meta-index-cols…

948d4e4

…tats-index-poc [HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC

manojpec changed the title ~~[WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups~~ [WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups Nov 8, 2021

manojpec added 2 commits November 9, 2021 09:46

[HUDI-2705] Metadata Index - Column stats metadata to speed up index …

efc8909

…lookup - read path PoC - Bug fix in external spillable map to avoid divide by zero case when estimating the record size and usages for the first time. - s3 path handling for inline fsutils

Merge pull request #3 from manojpec/feature/HUDI-2705-meta-index-cols…

9cba53c

…tats-index-poc [HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC

nsivabalan removed the priority:blocker Production down; release blocker label Nov 15, 2021

vinothchandar added the big-needle-movers label Dec 14, 2021

manojpec closed this Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups #3904

[WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups #3904

Uh oh!

manojpec commented Nov 2, 2021 •

edited

Loading

Uh oh!

hudi-bot commented Nov 9, 2021

Uh oh!

manojpec commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups #3904

[WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups #3904

Uh oh!

Conversation

manojpec commented Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

hudi-bot commented Nov 9, 2021

CI report:

Uh oh!

manojpec commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

manojpec commented Nov 2, 2021 •

edited

Loading