This project implements a Bloom Filter data structure using Murmur3 hash functions.
├── README.md
├── bloom_filter.py
└── main.py
- Python 3.x
- mmh3
- bitarray
- Clone the repository:
git clone <https://github.com/abhishekpatel946/bloom-filter.git>
- Install dependencies:
pip install mmh3 bitarray
- Modify main.py to customize the Bloom Filter parameters and test cases.
- Run main.py to see the Bloom Filter in action.
python main.py
- Result
Size of bit array: 124
False positive probability: 0.05
Number of hash functions: 4
word_present
['bolster', 'abound', 'colorful', 'comfort', 'abounds', 'generous', 'bonus', 'coherent', 'accessible', 'bonuses', 'cohesive', 'bonny', 'blossom', 'generosity', 'comely', 'genial', 'gems', 'abundance', 'bloom', 'abundant', 'generously']
word_absent
['bluff', 'hate', 'gloomy', 'geeksforgeeks', 'facebook', 'cheater', 'war', 'nuke', 'racism', 'twitter', 'hurt', 'humanity']
test_words
['racism', 'twitter', 'gloomy', 'coherent', 'facebook', 'hate', 'nuke', 'bonus', 'abounds', 'abound', 'humanity', 'bonuses', 'bolster', 'comfort', 'cheater', 'accessible', 'hurt', 'geeksforgeeks', 'bluff', 'colorful', 'war', 'generous']
racism is definitely not present...!❌
twitter is false positive!!! ⁉️
gloomy is definitely not present...!❌
...
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not. Interesting Properties of Bloom Filters
- Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.
- Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
- Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.
- Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.
- insert(x) : To insert an element in the Bloom Filter.
- lookup(x) : to check whether an element is already present in Bloom Filter with a positive false probability.
NOTE : We cannot delete an element in Bloom Filter. Probability of False positivity: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as: P = (1−[1−m/1]^kn)^k
If expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as: m = -(n log P)/(log2)^2
The number of hash functions k must be a positive integer. If m is size of bit array and n is number of elements to be inserted, then k can be calculated as: k = ((n/m)log2)
If we want to store large list of items in a set for purpose of set membership, we can store it in hashmap, tries or simple array or linked list. All these methods require storing item itself, which is not very memory efficient. For example, if we want to store “geeks” in hashmap we have to store actual string “ geeks” as a key value pair {some_key : ”geeks”}. Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact.
The hash function used in bloom filters should be independent and uniformly distributed. They should be fast as possible. Fast simple non cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes. Generating hash is major operation in bloom filters. Cryptographic hash functions provide stability and guarantee but are expensive in calculation. With increase in number of hash functions k, bloom filter become slow. All though non-cryptographic hash functions do not provide guarantee but provide major performance improvement.
Bloom Filters – Introduction and Implementation
Memcache - TutorialCachingStory
Dynamo: Amazon’s Highly Available Key-value Store
Firestore: The NoSQL Serverless Database for the Application Developer