-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing #666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e1e0a61 to
cad3b2f
Compare
…m filter for static sizing - Expose Index Config to make dynamic bloom filter configurable & keep backwards compatibility
cad3b2f to
760f513
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results seem great.. Few questions
- on tests, where DynamicBloomFilter achieves a better fpp/errorrate, did you notice an increase in size on disk?
- Also does reading a 2MB filter vs 200KB filter. what kind of difference did you notice in reading time for bloom filters? (I am parallely thinking about bumping up bloom filter entries default to 500K)
| public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = | ||
| "hoodie.bloom.index.input.storage" + ".level"; | ||
| public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER"; | ||
| public static final String BLOOM_INDEX_ENABLE_DYNAMIC_PROP = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to "hoodie.bloom.index.dynamic.bloomfilter" or "hoodie.bloom.index.auto.tune.bloomfilter" ? to make it clearer it's about the bloom filter and not the index checking..
Also this should probably belong in storage config? since its about how we write the parquet file right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we leave it in the IndexConfig since logically that's where people would look for all Index/BloomFilter related settings ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on how you look at it.. At the code level, its weird to suddenly access an index config in storage.. we can leave it here for now. but let's rename?
| HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable) throws IOException { | ||
| BloomFilter filter = new BloomFilter(config.getBloomFilterNumEntries(), | ||
| config.getBloomFilterFPP()); | ||
| config.getBloomFilterFPP(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should n't we pass the config here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash without the config here, actually we would not have written dynamic filters at all during the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran tests isolated only on the bloom filter and generated UUID's + random strings (to simulate non uuid based keys), the test don't go through this code path.
|
|
||
| public BloomFilter(int numEntries, double errorRate) { | ||
| this(numEntries, errorRate, Hash.MURMUR_HASH); | ||
| private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be curious to see which other projects use this.. if its easy to do that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hbase -> https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/BloomFilter.html
Looked through Hbase code and they have implemented their own BloomFilters based on the algorithms in the above class.
Cassandra -> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsTuningBloomFilters.html
They don't support dynamic bloom filters, instead force rewriting the bloom filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accumulo seems to be implementing something..
https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java also has one.. It'd be good to know tradeoffs each made. esp HBase and Accumulo
| String footerVal = readParquetFooter(configuration, parquetFilePath, | ||
| HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY).get(0); | ||
| return new BloomFilter(footerVal); | ||
| return new BloomFilter(footerVal, enableDynamicBloomFilter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming this is needed coz only then we 'd know the type of bloom filter and deserialize properly?Can BloomFilter read a DynamicBloomFilter serialized value? what happens when we have a mix of files written using normal and dynamic filterS? should we resolve this using an additional footer, instead of making this a writer side config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some thoughts around this, will discuss f2f
| import java.io.IOException; | ||
| import java.util.UUID; | ||
|
|
||
| public abstract class AbstractBloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename : AbstractBloomFilterTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
I did not see any significant increase in size on disk, if you see my comment above on performance results, there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes. |
Can we test with N= If proven to work, yes we should enable DynamicBloom by default. I think we have to do option 1 right? In option 2 also we 'd be reading old and new files with different filter formats right? do we handle an exception and detect dynamic vs normal bf? |
|
Have you serialized both the filters and see if the you can read a serialized DynamicBloomFilter as a BloomFilter? (Wishful thinking. but still could simplify things a lot if true) |
|
Closing in favor of #976 |
…ates (apache#666) Co-authored-by: wombatu-kun <[email protected]> Co-authored-by: Vova Kolmakov <[email protected]>
Performed the following tests to check for correctness & performance: