Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
bloom_filter.py		bloom_filter.py
main.py		main.py

README.md

Bloom Filter

Overview

This project implements a Bloom Filter data structure using Murmur3 hash functions.

Directory Structure

├── README.md
├── bloom_filter.py
└── main.py

Dependencies

Python 3.x
mmh3
bitarray

Installation

Clone the repository:

git clone <https://github.com/abhishekpatel946/bloom-filter.git>

Install dependencies:

pip install mmh3 bitarray

Usage

Modify main.py to customize the Bloom Filter parameters and test cases.
Run main.py to see the Bloom Filter in action.

python main.py

Result

Size of bit array: 124
False positive probability: 0.05
Number of hash functions: 4

word_present 
 ['bolster', 'abound', 'colorful', 'comfort', 'abounds', 'generous', 'bonus', 'coherent', 'accessible', 'bonuses', 'cohesive', 'bonny', 'blossom', 'generosity', 'comely', 'genial', 'gems', 'abundance', 'bloom', 'abundant', 'generously'] 

word_absent 
 ['bluff', 'hate', 'gloomy', 'geeksforgeeks', 'facebook', 'cheater', 'war', 'nuke', 'racism', 'twitter', 'hurt', 'humanity'] 

test_words 
 ['racism', 'twitter', 'gloomy', 'coherent', 'facebook', 'hate', 'nuke', 'bonus', 'abounds', 'abound', 'humanity', 'bonuses', 'bolster', 'comfort', 'cheater', 'accessible', 'hurt', 'geeksforgeeks', 'bluff', 'colorful', 'war', 'generous'] 

racism is definitely not present...!❌
twitter is false positive!!! ⁉️
gloomy is definitely not present...!❌
...

What is the Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not. Interesting Properties of Bloom Filters

Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.
Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.
Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.

Operations that a Bloom Filter supports

insert(x) : To insert an element in the Bloom Filter.
lookup(x) : to check whether an element is already present in Bloom Filter with a positive false probability.

NOTE : We cannot delete an element in Bloom Filter. Probability of False positivity: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as: P = (1−[1−m/1]^kn)^k

Size of Bit Array

If expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as: m = -(n log P)/(log2)^2

Optimum number of hash functions

The number of hash functions k must be a positive integer. If m is size of bit array and n is number of elements to be inserted, then k can be calculated as: k = ((n/m)log2)

Space Efficient

If we want to store large list of items in a set for purpose of set membership, we can store it in hashmap, tries or simple array or linked list. All these methods require storing item itself, which is not very memory efficient. For example, if we want to store “geeks” in hashmap we have to store actual string “ geeks” as a key value pair {some_key : ”geeks”}. Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact.

Choice of Hash Function

The hash function used in bloom filters should be independent and uniformly distributed. They should be fast as possible. Fast simple non cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes. Generating hash is major operation in bloom filters. Cryptographic hash functions provide stability and guarantee but are expensive in calculation. With increase in number of hash functions k, bloom filter become slow. All though non-cryptographic hash functions do not provide guarantee but provide major performance improvement.

References

Bloom Filters – Introduction and Implementation

Apache Cassandra

Memcache - TutorialCachingStory

Dynamo: Amazon’s Highly Available Key-value Store

Firestore: The NoSQL Serverless Database for the Application Developer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python

Python

README.md

Bloom Filter

Overview

Directory Structure

Dependencies

Installation

Usage

What is the Bloom Filter?

Operations that a Bloom Filter supports

Size of Bit Array

Optimum number of hash functions

Space Efficient

Choice of Hash Function

References

Files

Python

Directory actions

More options

Directory actions

More options

Latest commit

History

Python

Folders and files

parent directory

README.md

Bloom Filter

Overview

Directory Structure

Dependencies

Installation

Usage

What is the Bloom Filter?

Operations that a Bloom Filter supports

Size of Bit Array

Optimum number of hash functions

Space Efficient

Choice of Hash Function

References