diff --git a/rfc/README.md b/rfc/README.md index 5ef97300fcc35..a9d794a1101ed 100644 --- a/rfc/README.md +++ b/rfc/README.md @@ -59,4 +59,5 @@ The list of all RFCs can be found here. | 33 | [Hudi supports more comprehensive Schema Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution) | `IN PROGRESS` | | 34 | [Hudi BigQuery Integration (WIP)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980) | `UNDER REVIEW` | | 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly) | `UNDER REVIEW` | -| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` | \ No newline at end of file +| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` | +| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md) | `IN PROGRESS` | \ No newline at end of file diff --git a/rfc/rfc-37/metadata_index_1.png b/rfc/rfc-37/metadata_index_1.png new file mode 100644 index 0000000000000..40b834f40f9b8 Binary files /dev/null and b/rfc/rfc-37/metadata_index_1.png differ diff --git a/rfc/rfc-37/metadata_index_bloom_partition.png b/rfc/rfc-37/metadata_index_bloom_partition.png new file mode 100644 index 0000000000000..8ada4b7f2c18f Binary files /dev/null and b/rfc/rfc-37/metadata_index_bloom_partition.png differ diff --git a/rfc/rfc-37/metadata_index_col_stats.png b/rfc/rfc-37/metadata_index_col_stats.png new file mode 100644 index 0000000000000..02a77fe0dd6d2 Binary files /dev/null and b/rfc/rfc-37/metadata_index_col_stats.png differ diff --git a/rfc/rfc-37/rfc-37.md b/rfc/rfc-37/rfc-37.md new file mode 100644 index 0000000000000..a778ae77a395b --- /dev/null +++ b/rfc/rfc-37/rfc-37.md @@ -0,0 +1,168 @@ + +# RFC-37: Metadata based Bloom Index + + +## Proposers + +- @nsivabalan +- @manojpec + +## Approvers + - @ + - @ + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-2703 + +## Abstract +Hudi maintains indices to locate/map incoming records to file groups during writes. Most commonly +used record index is the HoodieBloomIndex. For larger installations and for global index types, performance might be an issue +due to loading of bloom from large number of data files and due to throttling issues with some of the cloud stores. We are proposing to +build a new Metadata index (metadata table based bloom index) to boost the performance of existing bloom index. + +## Background +HoodieBloomIndex is used to find the location of incoming records during every write. This will assist Hudi in deterministically +routing records to a given file group and to distinguish inserts vs updates. This bloom index relies on (min, max) values +of records keys and bloom indexes in base file footers to find the actual record location. In this RFC, we plan to +build a new index on top of metadata table which to assist in bloom index based tagging. + +## Design +HoodieBloomIndex involves the following steps to find the right location of incoming records +1. Load all interested partitions and fetch data files. +2. Find and filter files to keys mapping based on min max in data file footers. +3. Filter files to keys mapping based on bloom index in data file footers. +4. Look up actual data files to find the right location of every incoming record. + +As we could see from step 1 and 2, we are in need of min and max values for "_hoodie_record_key" and bloom filter for +all data files to perform the tagging. In this design, we will add these to metadata table and the index lookup +will look into these metadata table partitions to deduce the file to keys mapping. + +To realize this, we are adding two new partitions namely, `column_stats` and `bloom_filter` to metadata table. + +Why metadata table: +Metadata table uses HFile to store and retrieve data. HFile is an indexed file format and supports random lookups based on +keys. Since, we will be storing stats/bloom for every file and the index will do lookups based on files, we should be able to +benefit from the faster lookups in HFile. + +High Level Metadata Index Design + +Following sections will talk about different partitions, key formats and then dive into the data and control flows. + +### Column_Stats partition: +"Column_stats" will be discussed in depth in RFC-27, but in the interest of this RFC, Column_stats partition stores +statistics(min and max value) for `__hoodie_record_key` column for all files in the Hudi data table. + +High level requirement for this column_stats partition are: +Given a list of record keys, partition paths and file names, find the possibly matching file names based on +`__hoodie_record_key` column stats. + +To cater to this requirement, we need to ensure our keys in Hfile are such that we can do pointed lookups for a given data file. +Below picture gives a pictorial representation of Column stats partition in metadata table. + +Column Stats Partition + +We have to encode column names, filenames etc to IDs to save storage and to exploit compression. We will update the RFC +once we have more data around what kind of ID we can go with. On a high level, we are looking at incremental IDs vs +hash Ids. + +For now, lets assume that every entity will be given an ID (column name, partition path name, file name) + +``` +Key in column_stats partition = +[colId][PartitionId][FileId] +``` +``` +Value: stats { + min_value: bytes + max_value: bytes + ... + ... +} +``` + +### Bloom Filter Partition: +This will assist in storing bloom filters for all base files in the data table. This will be leveraged by metadata +index being designed with this RFC. + +Bloom filter partition + +Requirements:
+Given a list of FileIDs, return their bloom filters +``` +Key format: [PartitionId][FileId] +``` +``` +Value : +{ + serialized bloom + bloom type code +} +``` + +## Implementation + +### Writer flow: +Let's walk through the writer flow to update these partitions. + +Whenever a new commit is getting applied to metadata table, we do the following.
+1. Files partition - prepare records for adding +2. Column_stats partition - prepare records for adding +[ColumnID][PartitionID][FileID] => ColumnStats +This involves reading the base file footers to fetch min max values for each column +3. Bloom_filter partition - prepare records for adding +[PartitionID][FileID] => BloomFilter +This involves reading the base file footers. +We can amortize the cost across (2) and (3) and just read it once and prepare/populate records for both partitions. +4. . Commit all these records to metadata table. + +We need to ensure we have all sufficient info in WriteStatus gets passed to metadata writer for every commit. + +### Reader flow: + +This is actually a writer flow. When a new batch of write is ingested into Hudi, we need to tag the records with their +original file group location. And this index will leverage both the partitions to deduce the record key => file name mappings. + +``` +Input: JavaRdd +Output: JavaPairRdd +``` + +We will re-use some of the source code from existing bloom index implementation and direct the min max value filtering and +bloom based filtering to metadata table. + +The actual steps are as follows:
+1. Find all interested partitions +2. Fetch all files pertaining to the partitions of interest +3. Look up in column stats partition in metadata tale and find list of possible HoodieKeys against every file. +4. Look up in bloom filter partition in metadata table and find list of possible Hoodiekeys against every file. +5. Look up actual data file to deduce the actual record location for every HoodieKey. + +## Rollout/Adoption Plan + + - To be filled. + - What impact (if any) will there be on existing users? + - If we are changing behavior how will we phase out the older behavior? + - If we need special migration tools, describe them here. + - When will we remove the existing behavior + +## Test Plan + +Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?. + +To be filled. \ No newline at end of file