Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BitSet-backed Bloom filter #44

Merged
merged 10 commits into from
Dec 17, 2021
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
"typescript": "^3.7.5"
},
"dependencies": {
"base64-arraybuffer": "^1.0.1",
Copy link

@lovasoa lovasoa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we let the users encode the arraybuffers only if they need to ?

I understand that dealing with strings may be easier in some cases, but today, the web platform allows dealing with raw binary data in most contexts (network communication, file access, webworker messages, ...), and since the stated goal of this new feature is performance and storage size, maybe it would be wiser to let the user access the raw data (and encode it to some encoding if they need to) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a long-term goal, I think a binary format is a great idea. However, there is not currently a drop-in solution for converting JavaScript objects into binary. BSON does not support raw binary data, and Protobuf requires a fair amount of schema setup. This gives us a big win quickly, and is not an unreasonable format if we wish to support multiple serialization formats. Other open source formats support both binary & text modes for convenience (e.g. STL 3D model files.)

Base64 is such a common format on the web that MDN has sample code in its glossary. The new dependency on base64-arraybuffer is only 8k of code.

I don't want to require that users learn implementation details in order to serialize efficiently. Just ditching the Base64 encoding would force users to search our objects for TypedArrays in order to serialize them compactly, and then figure out how to deserialize them in a memory-efficient manner. They would also have to figure out what sort of TypedArrays we're using, avoid storing an ArrayBuffer-backed TypedArray twice, and deal with endian issues if they find anything other than a UInt8Array.

A binary format is great, but it's a bigger conversation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you are saying, but I still feel like it should be done the other way: provide the native implementation now (it's less code), and maybe add the base64 serialization as a usability improvement for those who need it later. Providing the base64 directly blocks the people who need the binary representation, but providing the binary representation still lets the ones who need base64 use it if they need to.

@Callidon what do you think ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I side with @dleppik on this one. We want users to be able to use the data structures out of the box, without to handle topics like binary representation. For most basic usages, it's fine.
We could provide another export method/API point to allow users to customize their binary representation, but it's outside of the scope of this PR IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovasoa, I'm not sure what you mean by the native implementation. TypedArrays break JSON serialization. JSON.parse(JSON.stringify(new Uint8Array(1))) yields { '0': 0 }, not a Uint8Array. JavaScript has no native binary format, and we need to store the entire Bloom filter, not just a byte array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you do a fully binary version, there are several libraries you may want to consider. BSON is a solid choice, but it may require some tweaking with TypedArrays. I've had good luck with Protobuf (Protocol Buffers), although it is definitely not plug-and-play; you need to write a .proto type definition which is geared toward high-volume data transfer. I've never tried MessagePack, but it may be the strongest contender since their sample JavaScript code serializes a UInt8Array.

Copy link

@lovasoa lovasoa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just create the byte array directly ? There are only three integers and a byte array to serialize, it may not be worth integrating yet another library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also metadata to include; that's often what trips up binary formats. I was assuming we'd want to support all the top-level classes, some of which are highly structured, while leveraging the existing reflection classes. That gets error-prone quickly.

You're right that adding another library could increase the footprint dramatically, especially if the classes are directly responsible for serializing themselves. If you're using a 1 MB library to save 0.5 MB of data transfer, it's a net loss. Even if you're using a 1 MB library to save 2 MB of data transfer, it may be a net loss in terms of latency, since loading the library blocks the data load. (I considered that before including base64-arraybuffer.) Anything significantly less than 12kB is a wash due to ethernet frame size.

The cleanest solution is to have the encoding in its own separate package, so that users don't have to download it if they don't want it. That also means one particular format doesn't preclude any other.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to include any metadata. We'll have a magic number to identify the format, the few integers we need, then the byte array. We don't need anything more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovasoa I will not open a PR once this is merged, but I will create an issue to track your suggestion. Thus, if you or someone else want to tackle it, they are free to do it. But for more, this topic is outside the scope of this PT.

"is-buffer": "^2.0.4",
"lodash": "^4.17.15",
"lodash.eq": "^4.0.0",
Expand Down
1 change: 1 addition & 0 deletions src/api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ SOFTWARE.
export { default as BloomFilter } from './bloom/bloom-filter'
export { default as CountingBloomFilter } from './bloom/counting-bloom-filter'
export { default as PartitionedBloomFilter } from './bloom/partitioned-bloom-filter'
export { default as BitSet } from './bloom/bit-set'
export { default as CountMinSketch } from './sketch/count-min-sketch'
export { default as HyperLogLog } from './sketch/hyperloglog'
export { default as TopK } from './sketch/topk'
Expand Down
134 changes: 134 additions & 0 deletions src/bloom/bit-set.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
/* file : BitSet.ts
MIT License

Copyright (c) 2021 David Leppik
folkvir marked this conversation as resolved.
Show resolved Hide resolved

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

import {encode, decode} from "base64-arraybuffer";

/** A memory-efficient Boolean array. Contains just the minimal operations needed for our Bloom filter implementation.
*
* @author David Leppik
*/

const bitsPerWord = 8;

folkvir marked this conversation as resolved.
Show resolved Hide resolved
export default class BitSet {
public readonly size: number;

// Uint32Array may be slightly faster due to memory alignment, but this avoids endianness when serializing
private array: Uint8Array;
folkvir marked this conversation as resolved.
Show resolved Hide resolved

constructor(size: number) {
this.size = size
this.array = new Uint8Array(Math.ceil(size / bitsPerWord))
folkvir marked this conversation as resolved.
Show resolved Hide resolved
}

has(index: number): boolean {
folkvir marked this conversation as resolved.
Show resolved Hide resolved
const wordIndex = Math.floor(index / bitsPerWord)
const mask = 1 << (index % bitsPerWord)
return (this.array[wordIndex] & mask) !== 0
}

add(index: number) {
const wordIndex = Math.floor(index / bitsPerWord)
const mask = 1 << (index % bitsPerWord)
this.array[wordIndex] = this.array[wordIndex] | mask
Copy link

@lovasoa lovasoa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can index the array just once

Suggested change
this.array[wordIndex] = this.array[wordIndex] | mask
this.array[wordIndex] |= mask

}

remove(index: number) {
const wordIndex = Math.floor(index / bitsPerWord)
const mask = 1 << (index % bitsPerWord)
this.array[wordIndex] = this.array[wordIndex] ^ mask
}

/** Returns the maximum set bit. */
max(): number {
for (let i = this.array.length - 1; i >= 0; i--) {
let bits = this.array[i];
if (bits) {
return BitSet.highBit(bits) + (i*bitsPerWord);
}
}
return 0;
}

bitCount(): number {
let result = 0
for (let i = 0; i < this.array.length; i++) {
result += BitSet.countBits(this.array[i]) // Assumes we never have bits set beyond the end
}
return result
folkvir marked this conversation as resolved.
Show resolved Hide resolved
}

public equals(other: BitSet): boolean {
if (other.size !== this.size) {
return false
}
for (let i = 0; i < this.array.length; i++) {
if (this.array[i] !== other.array[i]) {
return false
}
}
return true
}

public export(): BitSetData {
return {
size: this.size,
base64: encode(this.array)
Callidon marked this conversation as resolved.
Show resolved Hide resolved
}
}

public static import(data: any): BitSet {
if (typeof data.size !== "number") {
throw Error("BitSet missing size")
}
const result = new BitSet(data.size)
const buffer = decode(data.base64)
result.array = new Uint8Array(buffer)
return result
}

private static highBit(bits : number) : number {
Callidon marked this conversation as resolved.
Show resolved Hide resolved
let result = bitsPerWord - 1;
let mask = 1 << result;
while (result >= 0 && ((mask & bits) !== mask)) {
mask >>>= 1;
result--;
}
return result;
}

private static countBits(bits: number): number {
let result = bits & 1;
while (bits !== 0) {
bits = bits >>> 1;
result += (bits & 1)
}
return result
}
}

interface BitSetData {
folkvir marked this conversation as resolved.
Show resolved Hide resolved
size: number
base64: string
}
Callidon marked this conversation as resolved.
Show resolved Hide resolved
38 changes: 16 additions & 22 deletions src/bloom/bloom-filter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ SOFTWARE.

import ClassicFilter from '../interfaces/classic-filter'
import BaseFilter from '../base-filter'
import BitSet from "./bit-set";
import { AutoExportable, Field, Parameter } from '../exportable'
import { optimalFilterSize, optimalHashes } from '../formulas'
import { HashableInput, allocateArray, getDistinctIndices } from '../utils'
import { HashableInput, getDistinctIndices } from '../utils'

/**
* A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970,
Expand All @@ -42,16 +43,13 @@ import { HashableInput, allocateArray, getDistinctIndices } from '../utils'
@AutoExportable<BloomFilter>('BloomFilter', ['_seed'])
export default class BloomFilter extends BaseFilter implements ClassicFilter<HashableInput> {
@Field()
private _size: number
private readonly _size: number

@Field()
private _nbHashes: number
private readonly _nbHashes: number

@Field()
private _filter: Array<number>

@Field()
private _length: number
@Field<BitSet>(f => f.export(), d => BitSet.import(d))
private readonly _filter: BitSet

/**
* Constructor
Expand All @@ -65,13 +63,12 @@ export default class BloomFilter extends BaseFilter implements ClassicFilter<Has
}
this._size = size
this._nbHashes = nbHashes
this._filter = allocateArray(this._size, 0)
this._length = 0
this._filter = new BitSet(size)
}

/**
* Create an optimal bloom filter providing the maximum of elements stored and the error rate desired
* @param items - The maximum nuber of item to store
* @param nbItems - The maximum number of item to store
* @param errorRate - The error rate desired for a maximum of items inserted
* @return A new {@link BloomFilter}
*/
Expand All @@ -84,7 +81,7 @@ export default class BloomFilter extends BaseFilter implements ClassicFilter<Has
/**
* Build a new Bloom Filter from an existing iterable with a fixed error rate
* @param items - The iterable used to populate the filter
* @param errorRate - The error rate, i.e. 'false positive' rate, targetted by the filter
* @param errorRate - The error rate, i.e. 'false positive' rate, targeted by the filter
Callidon marked this conversation as resolved.
Show resolved Hide resolved
* @return A new Bloom Filter filled with the iterable's elements
* @example
* // create a filter with a false positive rate of 0.1
Expand All @@ -110,7 +107,7 @@ export default class BloomFilter extends BaseFilter implements ClassicFilter<Has
* @return The filter length
*/
get length (): number {
return this._length
return this._filter.bitCount()
}

/**
Expand All @@ -123,10 +120,7 @@ export default class BloomFilter extends BaseFilter implements ClassicFilter<Has
add (element: HashableInput): void {
const indexes = getDistinctIndices(element, this._size, this._nbHashes, this.seed)
for (let i = 0; i < indexes.length; i++) {
if (!this._filter[indexes[i]]) {
this._length++
}
this._filter[indexes[i]] = 1
this._filter.add(indexes[i]);
}
}

Expand All @@ -143,7 +137,7 @@ export default class BloomFilter extends BaseFilter implements ClassicFilter<Has
has (element: HashableInput): boolean {
const indexes = getDistinctIndices(element, this._size, this._nbHashes, this.seed)
for (let i = 0; i < indexes.length; i++) {
if (!this._filter[indexes[i]]) {
if (!this._filter.has(indexes[i])) {
return false
}
}
Expand All @@ -158,18 +152,18 @@ export default class BloomFilter extends BaseFilter implements ClassicFilter<Has
* console.log(filter.rate()); // output: something around 0.1
*/
rate (): number {
return Math.pow(1 - Math.exp(-this._length / this._size), this._nbHashes)
return Math.pow(1 - Math.exp(-this.length / this._size), this._nbHashes)
}

/**
* Check if another Bloom Filter is equal to this one
* @param filter - The filter to compare to this one
* @param other - The filter to compare to this one
Callidon marked this conversation as resolved.
Show resolved Hide resolved
* @return True if they are equal, false otherwise
*/
equals (other: BloomFilter): boolean {
if (this._size !== other._size || this._nbHashes !== other._nbHashes || this._length !== other._length) {
if (this._size !== other._size || this._nbHashes !== other._nbHashes) {
return false
}
return this._filter.every((value, index) => other._filter[index] === value)
return this._filter.equals(other._filter)
}
}
118 changes: 118 additions & 0 deletions test/bit-set-test.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
/* file : bit-set-test.js
MIT License

Copyright (c) 2021 David Leppik

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/


require('chai').should()
const { BitSet } = require('../dist/api')

describe('BitSet', () => {
it('is initially clear', () => {
const set = new BitSet(50)
set.size.should.equal(50)
for (let i=0; i<set.size; i++) {
Callidon marked this conversation as resolved.
Show resolved Hide resolved
set.has(i).should.equal(false)
}
})

it('reads and clears set values', () => {
const set = new BitSet(50)
set.size.should.equal(50)
for (let i=0; i<set.size; i++) {
set.has(i).should.equal(false)
set.add(i)
set.has(i).should.equal(true)
}
for (let i=0; i<set.size; i++) {
set.has(i).should.equal(true)
set.remove(i)
set.has(i).should.equal(false)
}
})

describe('#max', () => {
it('finds the high bit', () => {
const set = new BitSet(150)
set.size.should.equal(150)
for (let i=0; i<set.size; i++) {
set.add(i)
set.max().should.equal(i)
}
})
})

it('imports what it exports', () => {
const set = new BitSet(50)
for (let i=0; i < set.size; i += 3) { // 3 is relatively prime to 8, so should hit all edge cases
set.add(i)
}
const exported = set.export()
const imported = BitSet.import(exported)
imported.size.should.equal(set.size)
for (i=0; i < set.size; i++) {
let expected = i % 3 === 0;
set.has(i).should.equal(expected)
}
})

describe('#equals', () => {
it('returns true on identical size and data', () => {
let a = new BitSet(50)
let b = new BitSet(50)
a.equals(b).should.equal(true)
for (let i=0; i < a.size; i += 3) { // 3 is relatively prime to 8, so should hit all edge cases
a.add(i)
b.add(i)
a.equals(b).should.equal(true)
}
})

it('returns false on different size', () => {
new BitSet(50).equals(new BitSet(150)).should.equal(false)
})

it('returns false on different data', () => {
let a = new BitSet(50)
let b = new BitSet(50)
a.add(3)
a.equals(b).should.equal(false)
a.remove(3)
a.equals(b).should.equal(true)
a.add(49)
a.equals(b).should.equal(false)
})
})

describe('#bitCount', () => {
it('counts the number of bits', () => {
let set = new BitSet(50)
let expectedCount = 0
set.bitCount().should.equal(expectedCount)
for (let i = 0; i < set.size; i += 3) {
set.add(i)
expectedCount++
set.bitCount().should.equal(expectedCount)
}
})
})
})
Callidon marked this conversation as resolved.
Show resolved Hide resolved
Loading