-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BitSet-backed Bloom filter #44
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
7dfa3c1
Add BitSet with tests.
dleppik 2c70611
Bloom filter uses BitSet. Adds dependency on base64-arraybuffer for s…
dleppik a0b7904
BitSetData.base64 renamed content. More BitSet import/export tests.
dleppik 3832f3a
Fixed #saveAsJSON tests.
dleppik cf1a0bb
Added optional seed parameter to BoomFilter.from(), since it is used …
dleppik 29659ec
Fixed invalid argument in filter.has.
dleppik 0aa8c47
Style fixes: added documentation and terminating newline.
dleppik 2ea5f42
Update src/bloom/bit-set.ts
dleppik cffb44a
Style fixes: expanded doc comments.
dleppik 95ce810
Removed buggy, unused BitSet.prototype.remove.
dleppik File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,168 @@ | ||||||
/* file : BitSet.ts | ||||||
MIT License | ||||||
|
||||||
Copyright (c) 2021 David Leppik | ||||||
folkvir marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||||||
of this software and associated documentation files (the "Software"), to deal | ||||||
in the Software without restriction, including without limitation the rights | ||||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||||||
copies of the Software, and to permit persons to whom the Software is | ||||||
furnished to do so, subject to the following conditions: | ||||||
|
||||||
The above copyright notice and this permission notice shall be included in all | ||||||
copies or substantial portions of the Software. | ||||||
|
||||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||||||
SOFTWARE. | ||||||
*/ | ||||||
|
||||||
import {encode, decode} from "base64-arraybuffer"; | ||||||
|
||||||
const bitsPerWord = 8; | ||||||
|
||||||
/** | ||||||
* A memory-efficient Boolean array. Contains just the minimal operations needed for our Bloom filter implementation. | ||||||
* | ||||||
* @author David Leppik | ||||||
*/ | ||||||
export default class BitSet { | ||||||
readonly size: number; | ||||||
|
||||||
// Uint32Array may be slightly faster due to memory alignment, but this avoids endianness when serializing | ||||||
private array: Uint8Array; | ||||||
folkvir marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
/** | ||||||
* Constructor. All bits are initially set to false. | ||||||
* @param size the number of bits that can be stored. (This is NOT required to be a multiple of 8.) | ||||||
*/ | ||||||
constructor(size: number) { | ||||||
this.size = size | ||||||
this.array = new Uint8Array(Math.ceil(size / bitsPerWord)) | ||||||
folkvir marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
} | ||||||
|
||||||
/** | ||||||
* Returns the value of the bit at the given index | ||||||
* @param index position of the bit, zero-indexed | ||||||
*/ | ||||||
has(index: number): boolean { | ||||||
folkvir marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
const wordIndex = Math.floor(index / bitsPerWord) | ||||||
const mask = 1 << (index % bitsPerWord) | ||||||
return (this.array[wordIndex] & mask) !== 0 | ||||||
} | ||||||
|
||||||
/** | ||||||
* Set the bit to true | ||||||
* @param index position of the bit, zero-indexed | ||||||
*/ | ||||||
add(index: number) { | ||||||
const wordIndex = Math.floor(index / bitsPerWord) | ||||||
const mask = 1 << (index % bitsPerWord) | ||||||
this.array[wordIndex] = this.array[wordIndex] | mask | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can index the array just once
Suggested change
|
||||||
} | ||||||
|
||||||
/** | ||||||
* Returns the maximum true bit. | ||||||
*/ | ||||||
max(): number { | ||||||
for (let i = this.array.length - 1; i >= 0; i--) { | ||||||
let bits = this.array[i]; | ||||||
if (bits) { | ||||||
return BitSet.highBit(bits) + (i*bitsPerWord); | ||||||
} | ||||||
} | ||||||
return 0; | ||||||
} | ||||||
|
||||||
/** | ||||||
* Returns the number of true bits. | ||||||
*/ | ||||||
bitCount(): number { | ||||||
let result = 0 | ||||||
for (let i = 0; i < this.array.length; i++) { | ||||||
result += BitSet.countBits(this.array[i]) // Assumes we never have bits set beyond the end | ||||||
} | ||||||
return result | ||||||
folkvir marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
} | ||||||
|
||||||
/** | ||||||
* Returns true if the size and contents are identical. | ||||||
* @param other another BitSet | ||||||
*/ | ||||||
equals(other: BitSet): boolean { | ||||||
if (other.size !== this.size) { | ||||||
return false | ||||||
} | ||||||
for (let i = 0; i < this.array.length; i++) { | ||||||
if (this.array[i] !== other.array[i]) { | ||||||
return false | ||||||
} | ||||||
} | ||||||
return true | ||||||
} | ||||||
|
||||||
/** | ||||||
* Returns a JSON-encodable object readable by {@link import}. | ||||||
*/ | ||||||
export(): { size: number, content: string } { | ||||||
return { | ||||||
size: this.size, | ||||||
content: encode(this.array) | ||||||
} | ||||||
} | ||||||
|
||||||
/** | ||||||
* Returns an object written by {@link export}. | ||||||
* @param data an object written by {@link export} | ||||||
*/ | ||||||
static import(data: any): BitSet { | ||||||
if (typeof data.size !== "number") { | ||||||
throw Error("BitSet missing size") | ||||||
} | ||||||
if (typeof data.content !== "string") { | ||||||
throw Error("BitSet: missing content") | ||||||
} | ||||||
const result = new BitSet(data.size) | ||||||
const buffer = decode(data.content) | ||||||
result.array = new Uint8Array(buffer) | ||||||
return result | ||||||
} | ||||||
|
||||||
/** | ||||||
* Returns the index of the maximum bit in the number, or -1 for 0 | ||||||
* @bits an unsigned 8-bit number | ||||||
* @example | ||||||
* BitSet.highBit(0) // returns -1 | ||||||
* BitSet.highBit(5) // returns 2 | ||||||
*/ | ||||||
private static highBit(bits: number): number { | ||||||
let result = bitsPerWord - 1; | ||||||
let mask = 1 << result; | ||||||
while (result >= 0 && ((mask & bits) !== mask)) { | ||||||
mask >>>= 1; | ||||||
result--; | ||||||
} | ||||||
return result; | ||||||
} | ||||||
|
||||||
/** | ||||||
* Returns the number of true bits in the number | ||||||
* @bits an unsigned 8-bit number | ||||||
* @example | ||||||
* BitSet.countBits(0) // returns 0 | ||||||
* BitSet.countBits(3) // returns 2 | ||||||
*/ | ||||||
private static countBits(bits: number): number { | ||||||
let result = bits & 1; | ||||||
while (bits !== 0) { | ||||||
bits = bits >>> 1; | ||||||
result += (bits & 1) | ||||||
} | ||||||
return result | ||||||
} | ||||||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we let the users encode the arraybuffers only if they need to ?
I understand that dealing with strings may be easier in some cases, but today, the web platform allows dealing with raw binary data in most contexts (network communication, file access, webworker messages, ...), and since the stated goal of this new feature is performance and storage size, maybe it would be wiser to let the user access the raw data (and encode it to some encoding if they need to) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a long-term goal, I think a binary format is a great idea. However, there is not currently a drop-in solution for converting JavaScript objects into binary. BSON does not support raw binary data, and Protobuf requires a fair amount of schema setup. This gives us a big win quickly, and is not an unreasonable format if we wish to support multiple serialization formats. Other open source formats support both binary & text modes for convenience (e.g. STL 3D model files.)
Base64 is such a common format on the web that MDN has sample code in its glossary. The new dependency on base64-arraybuffer is only 8k of code.
I don't want to require that users learn implementation details in order to serialize efficiently. Just ditching the Base64 encoding would force users to search our objects for TypedArrays in order to serialize them compactly, and then figure out how to deserialize them in a memory-efficient manner. They would also have to figure out what sort of TypedArrays we're using, avoid storing an ArrayBuffer-backed TypedArray twice, and deal with endian issues if they find anything other than a UInt8Array.
A binary format is great, but it's a bigger conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you are saying, but I still feel like it should be done the other way: provide the native implementation now (it's less code), and maybe add the base64 serialization as a usability improvement for those who need it later. Providing the base64 directly blocks the people who need the binary representation, but providing the binary representation still lets the ones who need base64 use it if they need to.
@Callidon what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I side with @dleppik on this one. We want users to be able to use the data structures out of the box, without to handle topics like binary representation. For most basic usages, it's fine.
We could provide another export method/API point to allow users to customize their binary representation, but it's outside of the scope of this PR IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lovasoa, I'm not sure what you mean by the native implementation. TypedArrays break JSON serialization.
JSON.parse(JSON.stringify(new Uint8Array(1)))
yields{ '0': 0 }
, not a Uint8Array. JavaScript has no native binary format, and we need to store the entire Bloom filter, not just a byte array.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you do a fully binary version, there are several libraries you may want to consider. BSON is a solid choice, but it may require some tweaking with TypedArrays. I've had good luck with Protobuf (Protocol Buffers), although it is definitely not plug-and-play; you need to write a .proto type definition which is geared toward high-volume data transfer. I've never tried MessagePack, but it may be the strongest contender since their sample JavaScript code serializes a UInt8Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe just create the byte array directly ? There are only three integers and a byte array to serialize, it may not be worth integrating yet another library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also metadata to include; that's often what trips up binary formats. I was assuming we'd want to support all the top-level classes, some of which are highly structured, while leveraging the existing reflection classes. That gets error-prone quickly.
You're right that adding another library could increase the footprint dramatically, especially if the classes are directly responsible for serializing themselves. If you're using a 1 MB library to save 0.5 MB of data transfer, it's a net loss. Even if you're using a 1 MB library to save 2 MB of data transfer, it may be a net loss in terms of latency, since loading the library blocks the data load. (I considered that before including base64-arraybuffer.) Anything significantly less than 12kB is a wash due to ethernet frame size.
The cleanest solution is to have the encoding in its own separate package, so that users don't have to download it if they don't want it. That also means one particular format doesn't preclude any other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to include any metadata. We'll have a magic number to identify the format, the few integers we need, then the byte array. We don't need anything more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lovasoa I will not open a PR once this is merged, but I will create an issue to track your suggestion. Thus, if you or someone else want to tackle it, they are free to do it. But for more, this topic is outside the scope of this PT.