Skip to content

Conversation

@n3nash
Copy link
Contributor

@n3nash n3nash commented May 6, 2019

Performed the following tests to check for correctness & performance:

Correctness
numEntries: 1000
ErrorRate: 0.0000001
NumKeysAdded : 10000 & 100000 (10X and 100X)
Num Iterations : 100
BloomIndex DynamicBloomIndex
10+ false positives average, best case = 10, worst case = 1000 0 false positives best, average & worst case
Speed
numEntries: 500000
ErrorRate: 0.0000001
NumKeysAdded : 500000
Iterations : 100
BloomIndex DynamicBloomIndex
0.3 secs average, best & worst case best case = 0.3 secs & worst case = 0.7 secs
Size
numEntries: 500000
ErrorRate: 0.0000001
NumKeysAdded : 500000
Iterations : 100
BloomIndex DynamicBloomIndex
2 MB average, best & worst case 2 MB average, best & worst case

@n3nash n3nash force-pushed the dynamic_bloom_filter branch 4 times, most recently from e1e0a61 to cad3b2f Compare May 6, 2019 04:34
…m filter for static sizing

- Expose Index Config to make dynamic bloom filter configurable & keep backwards compatibility
@n3nash n3nash force-pushed the dynamic_bloom_filter branch from cad3b2f to 760f513 Compare May 6, 2019 04:55
@n3nash n3nash requested a review from bvaradar May 6, 2019 05:49
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results seem great.. Few questions

  • on tests, where DynamicBloomFilter achieves a better fpp/errorrate, did you notice an increase in size on disk?
  • Also does reading a 2MB filter vs 200KB filter. what kind of difference did you notice in reading time for bloom filters? (I am parallely thinking about bumping up bloom filter entries default to 500K)

public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL =
"hoodie.bloom.index.input.storage" + ".level";
public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
public static final String BLOOM_INDEX_ENABLE_DYNAMIC_PROP =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to "hoodie.bloom.index.dynamic.bloomfilter" or "hoodie.bloom.index.auto.tune.bloomfilter" ? to make it clearer it's about the bloom filter and not the index checking..

Also this should probably belong in storage config? since its about how we write the parquet file right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we leave it in the IndexConfig since logically that's where people would look for all Index/BloomFilter related settings ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on how you look at it.. At the code level, its weird to suddenly access an index config in storage.. we can leave it here for now. but let's rename?

HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable) throws IOException {
BloomFilter filter = new BloomFilter(config.getBloomFilterNumEntries(),
config.getBloomFilterFPP());
config.getBloomFilterFPP(), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should n't we pass the config here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash without the config here, actually we would not have written dynamic filters at all during the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran tests isolated only on the bloom filter and generated UUID's + random strings (to simulate non uuid based keys), the test don't go through this code path.


public BloomFilter(int numEntries, double errorRate) {
this(numEntries, errorRate, Hash.MURMUR_HASH);
private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be curious to see which other projects use this.. if its easy to do that..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hbase -> https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/BloomFilter.html
Looked through Hbase code and they have implemented their own BloomFilters based on the algorithms in the above class.
Cassandra -> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsTuningBloomFilters.html
They don't support dynamic bloom filters, instead force rewriting the bloom filter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String footerVal = readParquetFooter(configuration, parquetFilePath,
HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY).get(0);
return new BloomFilter(footerVal);
return new BloomFilter(footerVal, enableDynamicBloomFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming this is needed coz only then we 'd know the type of bloom filter and deserialize properly?Can BloomFilter read a DynamicBloomFilter serialized value? what happens when we have a mix of files written using normal and dynamic filterS? should we resolve this using an additional footer, instead of making this a writer side config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some thoughts around this, will discuss f2f

import java.io.IOException;
import java.util.UUID;

public abstract class AbstractBloomFilter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename : AbstractBloomFilterTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vinothchandar vinothchandar self-assigned this May 6, 2019
@n3nash
Copy link
Contributor Author

n3nash commented May 7, 2019

I did not see any significant increase in size on disk, if you see my comment above on performance results, there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes.
Did not capture for reading times (disk -> memory), I'll add some results for that too. But I was thinking of bumping up the default entries higher as well, to 500Kish..
General comment :
Option1 : Should we just use DynamicBloom by default ? What we can do is add a bit in the footer to say if the new bloom filter is dynamic or not. When we deserialize, if we find this bit, we treat it as dynamic, else regular bloom. Whenever a file is rewritten (through merge handle), convert the regular bloom to dynamic..
Option 2 : Start with a large default for bloom filter (assuming read is not affected much). If the file is ever rewritten, depending on the number of entries in the parquet file, reevaluate the number of entries in the bloom filter and reset it..

@vinothchandar
Copy link
Member

vinothchandar commented May 7, 2019

there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes.

Can we test with N=500000 fp=0.000000001 and 10x/100x that? I think that will produce larger sizes/more fps. I would be surprised if dynamic provides much less fp's with same number of bits. All it must be doing is to use more bits as more entries come in. you can use something like https://krisives.github.io/bloom-calculator/ to design a case around this..

If proven to work, yes we should enable DynamicBloom by default. I think we have to do option 1 right? In option 2 also we 'd be reading old and new files with different filter formats right? do we handle an exception and detect dynamic vs normal bf?

@vinothchandar
Copy link
Member

Have you serialized both the filters and see if the you can read a serialized DynamicBloomFilter as a BloomFilter? (Wishful thinking. but still could simplify things a lot if true)

@n3nash n3nash changed the title Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing (WIP) Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing May 10, 2019
@vinothchandar vinothchandar added the status:in-progress Work in progress label May 16, 2019
@vinothchandar vinothchandar changed the title (WIP) Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing May 16, 2019
@vinothchandar
Copy link
Member

Closing in favor of #976

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status:in-progress Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants