Skip to content

Conversation

@manojpec
Copy link
Contributor

@manojpec manojpec commented Nov 2, 2021

NOTE: New version of this PR at #4352

What is the purpose of the pull request

  • Today, base files have bloom filter at their footers and index lookups
    have to load the base file to perform any bloom lookups. Though we have
    interval tree based file purging, we still end up in significant amount
    of base file read for the bloom filter for the end index lookups for the
    keys. This index lookup operation can be made more performant by having
    all the bloom filters in a new metadata partition and doing pointed
    lookups based on keys.

  • This PR adds the basic infrastructure for initializing the new bloom filter
    metadata partition, stubs for write and read code paths

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Nov 3, 2021
…lookups

- Today, base files have bloom filter at their footers and index lookups
  have to load the base file to perform any bloom lookups. Though we have
  interval tree based file purging, we still end up in significant amount
  of base file read for the bloom filter for the end index lookups for the
  keys. This index lookup operation can be made more performant by having
  all the bloom filters in a new metadata partition and doing pointed
  lookups based on keys.

- This PR adds the basic infrastructure for initializing the new bloom filter
  metadata partition, stubs for write and read code paths
…lookups

- HoodieBackedTableMetadataWriter now prepares records for all its partitions
  using util methods and hands over the map of partition type to list of
  metadata records to the Spark/Flink commit routine.
…lookups - read/write path PoC

- PoC for the metadata table based bloom filter index.

- BloomIndexHelper findMatchingFilesForRecordKeys has been modified to look
  at the new metadata table bloom partition for filtering the keys
@manojpec manojpec force-pushed the feature/HUDI-1295-meta-index-bloom-filter-partition-1 branch from 8f19898 to f66b2f9 Compare November 6, 2021 00:53
…lookups - read/write path PoC

- Fixing HoodieBackedTableMetadata::getRecordByKeys() to consider all
  file groups in the partition.

- Using BloomFilter type dynaming instead of SimpleBloomFilter
@manojpec manojpec force-pushed the feature/HUDI-1295-meta-index-bloom-filter-partition-1 branch from f66b2f9 to bdb2fa6 Compare November 6, 2021 01:04
…lookups - read/write path PoC

 - SparkHoodieBloomIndexHeler now uses HoodieBloomMetaIndexGroupedFunction to fetch
   metadata table bloom indices for various partitions and fileId in a sorted fashion
   so that each executor does only sequential access.
…lookup - write path PoC

- In the metadata table, now we write a new partition for column stats
  to record the column level min, max ranges.

- Metadata table now has 3 partitions in total - files, bloom-filter, col-stats

- Column stats partition file group, record key and contents still needs to be fixed.
…lookup - read path PoC

 - Spark findMatchingFilesForRecordKeys now uses the new HoodieBloomMetaIndexColStatFunction
   to prune files using metadata table based column stats

 - After pruning the file list, it uses HoodieBloomMetaIndexGroupedFunction for the metadata
   table based bloom filter lookup to find the actual files for the keys and then does the
   file lookup to confirm the keys presence
…lookup - read path PoC

 - BloomFilter key and the ColumnStat key now includes the PartitionID

 - HoodieBloomIndex::lookupIndex() now uses column stats instead of loading all the
   BloomIndexFileInfo from the individual files for file pruning
…lookup - read path PoC

 - Adding lazy bloom filter metadata index loading feature when looking up index.

 - Added config to enable bloom filter lazy_load/bulk_load for index lookups

 - Added config to enable index lookup verbose logging

 - Added config to enable index lookup timer logging
…tats-index-poc

[HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC
@manojpec manojpec changed the title [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups [WIP][HUDI-1295] Metadata Index - Bloom filter and Column stats metadata to speed up index lookups Nov 8, 2021
…lookup - read path PoC

 - Bug fix in external spillable map to avoid divide by zero case when estimating
   the record size and usages for the first time.

 - s3 path handling for inline fsutils
…tats-index-poc

[HUDI-2705] Metadata Index - Column stats metadata to speed up index lookup - read path PoC
@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 9, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan removed the priority:blocker Production down; release blocker label Nov 15, 2021
@manojpec
Copy link
Contributor Author

This PR is taken over by #4352

@manojpec manojpec closed this Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants