Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing #666

n3nash · 2019-05-06T03:55:05Z

Performed the following tests to check for correctness & performance:

Correctness
numEntries: 1000
ErrorRate: 0.0000001
NumKeysAdded : 10000 & 100000 (10X and 100X)
Num Iterations : 100

BloomIndex	DynamicBloomIndex
10+ false positives average, best case = 10, worst case = 1000	0 false positives best, average & worst case

Speed
numEntries: 500000
ErrorRate: 0.0000001
NumKeysAdded : 500000
Iterations : 100

BloomIndex	DynamicBloomIndex
0.3 secs average, best & worst case	best case = 0.3 secs & worst case = 0.7 secs

Size
numEntries: 500000
ErrorRate: 0.0000001
NumKeysAdded : 500000
Iterations : 100

BloomIndex	DynamicBloomIndex
2 MB average, best & worst case	2 MB average, best & worst case

…m filter for static sizing - Expose Index Config to make dynamic bloom filter configurable & keep backwards compatibility

vinothchandar

Results seem great.. Few questions

on tests, where DynamicBloomFilter achieves a better fpp/errorrate, did you notice an increase in size on disk?
Also does reading a 2MB filter vs 200KB filter. what kind of difference did you notice in reading time for bloom filters? (I am parallely thinking about bumping up bloom filter entries default to 500K)

vinothchandar · 2019-05-06T16:06:18Z

hoodie-client/src/main/java/com/uber/hoodie/config/HoodieIndexConfig.java

  public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL =
      "hoodie.bloom.index.input.storage" + ".level";
  public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
+  public static final String BLOOM_INDEX_ENABLE_DYNAMIC_PROP =


rename to "hoodie.bloom.index.dynamic.bloomfilter" or "hoodie.bloom.index.auto.tune.bloomfilter" ? to make it clearer it's about the bloom filter and not the index checking..

Also this should probably belong in storage config? since its about how we write the parquet file right

I suggest we leave it in the IndexConfig since logically that's where people would look for all Index/BloomFilter related settings ?

depends on how you look at it.. At the code level, its weird to suddenly access an index config in storage.. we can leave it here for now. but let's rename?

vinothchandar · 2019-05-06T16:07:39Z

hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieStorageWriterFactory.java

      HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable) throws IOException {
    BloomFilter filter = new BloomFilter(config.getBloomFilterNumEntries(),
-        config.getBloomFilterFPP());
+        config.getBloomFilterFPP(), false);


should n't we pass the config here as well?

@n3nash without the config here, actually we would not have written dynamic filters at all during the tests?

I ran tests isolated only on the bloom filter and generated UUID's + random strings (to simulate non uuid based keys), the test don't go through this code path.

vinothchandar · 2019-05-06T16:08:47Z

hoodie-common/src/main/java/com/uber/hoodie/common/BloomFilter.java


-  public BloomFilter(int numEntries, double errorRate) {
-    this(numEntries, errorRate, Hash.MURMUR_HASH);
+  private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = null;


would be curious to see which other projects use this.. if its easy to do that..

Hbase -> https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/BloomFilter.html
Looked through Hbase code and they have implemented their own BloomFilters based on the algorithms in the above class.
Cassandra -> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsTuningBloomFilters.html
They don't support dynamic bloom filters, instead force rewriting the bloom filter.

Accumulo seems to be implementing something..
https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java as well.

https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java also has one.. It'd be good to know tradeoffs each made. esp HBase and Accumulo

vinothchandar · 2019-05-06T16:14:20Z

hoodie-common/src/main/java/com/uber/hoodie/common/util/ParquetUtils.java

    String footerVal = readParquetFooter(configuration, parquetFilePath,
        HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY).get(0);
-    return new BloomFilter(footerVal);
+    return new BloomFilter(footerVal, enableDynamicBloomFilter);


I am assuming this is needed coz only then we 'd know the type of bloom filter and deserialize properly?Can BloomFilter read a DynamicBloomFilter serialized value? what happens when we have a mix of files written using normal and dynamic filterS? should we resolve this using an additional footer, instead of making this a writer side config

Have some thoughts around this, will discuss f2f

vinothchandar · 2019-05-06T16:14:43Z

hoodie-common/src/test/java/com/uber/hoodie/common/AbstractBloomFilter.java

+import java.io.IOException;
+import java.util.UUID;
+
+public abstract class AbstractBloomFilter {


rename : AbstractBloomFilterTest

n3nash · 2019-05-07T04:03:28Z

I did not see any significant increase in size on disk, if you see my comment above on performance results, there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes.
Did not capture for reading times (disk -> memory), I'll add some results for that too. But I was thinking of bumping up the default entries higher as well, to 500Kish..
General comment :
Option1 : Should we just use DynamicBloom by default ? What we can do is add a bit in the footer to say if the new bloom filter is dynamic or not. When we deserialize, if we find this bit, we treat it as dynamic, else regular bloom. Whenever a file is rewritten (through merge handle), convert the regular bloom to dynamic..
Option 2 : Start with a large default for bloom filter (assuming read is not affected much). If the file is ever rewritten, depending on the number of entries in the parquet file, reevaluate the number of entries in the bloom filter and reset it..

vinothchandar · 2019-05-07T14:48:09Z

there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes.

Can we test with N=500000 fp=0.000000001 and 10x/100x that? I think that will produce larger sizes/more fps. I would be surprised if dynamic provides much less fp's with same number of bits. All it must be doing is to use more bits as more entries come in. you can use something like https://krisives.github.io/bloom-calculator/ to design a case around this..

If proven to work, yes we should enable DynamicBloom by default. I think we have to do option 1 right? In option 2 also we 'd be reading old and new files with different filter formats right? do we handle an exception and detect dynamic vs normal bf?

vinothchandar · 2019-05-07T14:49:23Z

Have you serialized both the filters and see if the you can read a serialized DynamicBloomFilter as a BloomFilter? (Wishful thinking. but still could simplify things a lot if true)

vinothchandar · 2019-10-30T04:10:48Z

Closing in favor of #976

…ates (apache#666) Co-authored-by: wombatu-kun <[email protected]> Co-authored-by: Vova Kolmakov <[email protected]>

n3nash force-pushed the dynamic_bloom_filter branch 4 times, most recently from e1e0a61 to cad3b2f Compare May 6, 2019 04:34

- Add support for dynamic bloom filter to increase efficiency of bloo…

760f513

…m filter for static sizing - Expose Index Config to make dynamic bloom filter configurable & keep backwards compatibility

n3nash force-pushed the dynamic_bloom_filter branch from cad3b2f to 760f513 Compare May 6, 2019 04:55

n3nash requested a review from bvaradar May 6, 2019 05:49

vinothchandar reviewed May 6, 2019

View reviewed changes

vinothchandar self-assigned this May 6, 2019

n3nash changed the title ~~Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing~~ (WIP) Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing May 10, 2019

vinothchandar added the status:in-progress Work in progress label May 16, 2019

vinothchandar changed the title ~~(WIP) Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing~~ Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing May 16, 2019

vinothchandar closed this Oct 30, 2019

zxding mentioned this pull request Apr 16, 2022

[SUPPORT] prometheus metrics labels #5326

Open

dujl mentioned this pull request Jun 12, 2022

[SUPPORT] key hoodie.table.partition.fields is stored in HiveMetaStore when create non-partition table in Spark #5836

Closed

dujl mentioned this pull request Sep 17, 2022

[HUDI-4237] should not sync partition parameters when create non-partition table in spark #6525

Closed

4 tasks

Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing #666

Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing #666

Uh oh!

Conversation

n3nash commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash commented May 7, 2019

Uh oh!

vinothchandar commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented May 7, 2019

Uh oh!

vinothchandar commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

n3nash commented May 6, 2019 •

edited

Loading

vinothchandar commented May 7, 2019 •

edited

Loading