Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Bloom Filter

Overview

This project implements a Bloom Filter data structure using Murmur3 hash functions.

Directory Structure

├── README.md
├── bloom_filter.py
└── main.py

Dependencies

  • Python 3.x
  • mmh3
  • bitarray

Installation

  1. Clone the repository:
git clone <https://github.com/abhishekpatel946/bloom-filter.git>
  1. Install dependencies:
pip install mmh3 bitarray

Usage

  1. Modify main.py to customize the Bloom Filter parameters and test cases.
  2. Run main.py to see the Bloom Filter in action.
python main.py
  1. Result
Size of bit array: 124
False positive probability: 0.05
Number of hash functions: 4

word_present 
 ['bolster', 'abound', 'colorful', 'comfort', 'abounds', 'generous', 'bonus', 'coherent', 'accessible', 'bonuses', 'cohesive', 'bonny', 'blossom', 'generosity', 'comely', 'genial', 'gems', 'abundance', 'bloom', 'abundant', 'generously'] 

word_absent 
 ['bluff', 'hate', 'gloomy', 'geeksforgeeks', 'facebook', 'cheater', 'war', 'nuke', 'racism', 'twitter', 'hurt', 'humanity'] 

test_words 
 ['racism', 'twitter', 'gloomy', 'coherent', 'facebook', 'hate', 'nuke', 'bonus', 'abounds', 'abound', 'humanity', 'bonuses', 'bolster', 'comfort', 'cheater', 'accessible', 'hurt', 'geeksforgeeks', 'bluff', 'colorful', 'war', 'generous'] 

racism is definitely not present...!❌
twitter is false positive!!! ⁉️
gloomy is definitely not present...!❌
...

What is the Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not. Interesting Properties of Bloom Filters

  • Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.
  • Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
  • Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.
  • Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.

Operations that a Bloom Filter supports

  • insert(x) : To insert an element in the Bloom Filter.
  • lookup(x) : to check whether an element is already present in Bloom Filter with a positive false probability.

NOTE : We cannot delete an element in Bloom Filter. Probability of False positivity: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as: P = (1−[1−m/1]^kn)^k

Size of Bit Array

If expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as: m = -(n log P)/(log2)^2​

Optimum number of hash functions

The number of hash functions k must be a positive integer. If m is size of bit array and n is number of elements to be inserted, then k can be calculated as: k = ((n/m)log2)

Space Efficient

If we want to store large list of items in a set for purpose of set membership, we can store it in hashmap, tries or simple array or linked list. All these methods require storing item itself, which is not very memory efficient. For example, if we want to store “geeks” in hashmap we have to store actual string “ geeks” as a key value pair {some_key : ”geeks”}. Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact.

Choice of Hash Function

The hash function used in bloom filters should be independent and uniformly distributed. They should be fast as possible. Fast simple non cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes. Generating hash is major operation in bloom filters. Cryptographic hash functions provide stability and guarantee but are expensive in calculation. With increase in number of hash functions k, bloom filter become slow. All though non-cryptographic hash functions do not provide guarantee but provide major performance improvement.

References

Bloom Filters – Introduction and Implementation

Apache Cassandra

Memcache - TutorialCachingStory

Dynamo: Amazon’s Highly Available Key-value Store

Firestore: The NoSQL Serverless Database for the Application Developer