Skip to content

Commit

Permalink
#29
Browse files Browse the repository at this point in the history
  • Loading branch information
folkvir committed Dec 30, 2021
1 parent 3d4d57e commit ed02f08
Show file tree
Hide file tree
Showing 3 changed files with 77 additions and 11 deletions.
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ JavaScript/TypeScript implementation of probabilistic data structures: Bloom Fil
- [MinHash](#minhash)
- [Top-K](#top-k)
- [Invertible Bloom Filters](#invertible-bloom-filters)
- [XOR-Filter](#xor-filter)
- [Export and import](#export-and-import)
- [Seeding and Hashing](#seeding-and-hashing)
- [Documentation](#documentation)
Expand Down Expand Up @@ -460,6 +461,38 @@ filter = InvertibleBloomFilter.from(items, errorRate)

**Tuning the IBLT** We recommend to use at least a **hashcount** of 3 and an **alpha** of 1.5 for at least 50 differences, which equals to 1.5\*50 = 75 cells. Then, if you insert a huge number of values in there, the decoding will work (whatever the number of differences less than 50) but testing the presence of a value is still probabilistic, based on the number of elements inserted (Even for the functions like listEntries). For more details, you should read the seminal research paper on IBLTs ([full-text article](http://www.sysnet.ucsd.edu/sysnet/miscpapers/EppGooUye-SIGCOMM-11.pdf)).

### XOR Filter

**Available as 8-bits and 16-bits fingerprint length**

A XOR Filter is a better space-efficient probabilistic data structure than Bloom Filters.
Very usefull for space efficiency of readonly sets.

**Reference:** Graf, Thomas Mueller, and Daniel Lemire. "Xor filters: Faster and smaller than bloom and cuckoo filters." Journal of Experimental Algorithmics (JEA) 25 (2020): 1-16.
([Full text article](https://arxiv.org/abs/1912.08258))

#### Methods

- `add(elements: HashableInput[]) -> void`: Add elements to the filter. Calling more than once this methods will override the current filter with the new elements.
- `has(element: HashableInput) -> boolean`: true/false whether the element is in the set or not.

```javascript
const {XorFilter} = require('bloom-filters')
const xor8 = new XorFilter(1)
xor8.add(['a'])
xor8.has('a') // true
xor8.has('b') // false
// or the combined
const filter = XorFilter.create(['a'])
filter.has('a') // true
// using 16-bits fingerprint length
XorFilter.create(['a'], 16).has('a') // true
const a = new XorFilter(1, 16)
a.add(['a'])
a.has('a') // true
```


## Export and import

All data structures exposed by this package can be **exported and imported to/from JSON**:
Expand Down Expand Up @@ -553,12 +586,13 @@ When submitting pull requests please follow the following guidance:
- [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf): Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier (2007). _"Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"_. Discrete Mathematics and Theoretical Computer Science Proceedings.
- [MinHash](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.779&rep=rep1&type=pdf): Andrei Z. Broder, _"On the resemblance and containment of documents"_, in Compression and Complexity of Sequences: Proceedings (1997).
- [Invertible Bloom Filters](http://www.sysnet.ucsd.edu/sysnet/miscpapers/EppGooUye-SIGCOMM-11.pdf): Eppstein, D., Goodrich, M. T., Uyeda, F., & Varghese, G. (2011). _What's the difference?: efficient set reconciliation without prior context._ ACM SIGCOMM Computer Communication Review, 41(4), 218-229.
- [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258) Thomas Mueller Graf, Daniel Lemire, Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122

## Changelog

| **Version** | **Release date** | **Major changes** |
| ----------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `v2.0.0` | 8/12/2021 | Use correctly double hashing [#issue43](https://github.com/Callidon/bloom-filters/issues/43), rename `getIndices` to `getIndexes` and `getDistinctIndices` to `getDistinctIndexes`. Use `getIndexes` every where except for IBLTs where `getDistinctIndexes` is used. Add support for switching from/to 32/64 bits hash function. Add [#PR44](https://github.com/Callidon/bloom-filters/pull/44) optimizing the BloomFilter internal storage with Uint arrays. Disable 10.x node tests. |
| `v2.0.0` | 12/2021 | Use correctly double hashing [#issue43](https://github.com/Callidon/bloom-filters/issues/43), rename `getIndices` to `getIndexes` and `getDistinctIndices` to `getDistinctIndexes`. Use `getIndexes` every where except for IBLTs where `getDistinctIndexes` is used. Add [#PR44](https://github.com/Callidon/bloom-filters/pull/44) optimizing the BloomFilter internal storage with Uint arrays. Disable 10.x node tests. Add XorFilter [#29](https://github.com/Callidon/bloom-filters/issues/29) |
| `v1.3.0` | 10/04/2020 | Added the MinHash set |
| `v1.2.0` | 08/04/2020 | Add the TopK class |
| `v1.1.0` | 03/04/2020 | Add the HyperLogLog sketch |
Expand Down
20 changes: 16 additions & 4 deletions src/base-filter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,30 @@ SOFTWARE.
import * as utils from './utils'
import seedrandom from 'seedrandom'

/**
* Exported prng type because it is not from seedrandom
* Orignal type can be found in: @types/seedrandom
*/
export interface prng {
(): number
double(): number
int32(): number
quick(): number
state(): seedrandom.State
}

/**
* A base class for implementing probailistic filters
* @author Thomas Minier
* @author Arnaud Grall
*/
export default abstract class BaseFilter {
private _seed: number
private _rng: any
private _rng: prng

constructor() {
this._seed = utils.getDefaultSeed()
this._rng = seedrandom(`${this._seed}`)
this._rng = seedrandom(`${this._seed}`) as prng
}

/**
Expand All @@ -54,14 +66,14 @@ export default abstract class BaseFilter {
*/
set seed(seed: number) {
this._seed = seed
this._rng = seedrandom(`${this._seed}`)
this._rng = seedrandom(`${this._seed}`) as prng
}

/**
* Get a function used to draw random number
* @return A factory function used to draw random integer
*/
get random(): () => number {
get random(): prng {
return this._rng
}

Expand Down
32 changes: 26 additions & 6 deletions src/bloom/xor-filter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,20 @@ CONSTANTS.set(16, 0xffff)

/**
* XOR-Filter for 8-bits and 16-bits fingerprint length.
*
* To use for fixed sets of elements only
* Inspired from @see https://github.com/FastFilter/fastfilter_java
* @example
* ```js
* const xor8 = new XorFilter(1) // default fingerprint of 8 bits
* xor8.add(['a'])
* xor8.has('a') // true
* xor8.has('b') // false
* const xor16 = new XorFilter(1, 16)
* xor16.add(['a'])
* xor16.has('a') // true
* xor16.has('b') // false
* ```
*/
@AutoExportable<XorFilter>('XorFilter', ['_seed'])
export default class XorFilter extends BaseFilter {
Expand All @@ -50,14 +62,17 @@ export default class XorFilter extends BaseFilter {
/**
* Buffer array of fingerprints
*/
@Field<Buffer[]>(d => d.map(encode), d => d.map((e: string) => Buffer.from(decode(e))))
@Field<Buffer[]>(
d => d.map(encode),
d => d.map((e: string) => Buffer.from(decode(e)))
)
private _filter: Buffer[]

/**
* Number of bits per fingerprint
*/
@Field()
private _bits: number = 8
private _bits = 8

/**
* Number of elements inserted in the filter
Expand Down Expand Up @@ -109,7 +124,10 @@ export default class XorFilter extends BaseFilter {
* @returns
*/
public has(element: HashableInput): boolean {
const hash = this._hash64(this._hashable_to_long(element, this.seed), this.seed)
const hash = this._hash64(
this._hashable_to_long(element, this.seed),
this.seed
)
const fingerprint = this._fingerprint(hash).toInt()
const r0 = Long.fromInt(hash.toInt())
const r1 = Long.fromInt(hash.rotl(21).toInt())
Expand All @@ -129,13 +147,15 @@ export default class XorFilter extends BaseFilter {
* Warning: Another call will override the previously created filter.
* @param elements
* @example
* ```js
* const xor = new XorFilter(1, 8)
* xor.add(['alice'])
* xor.has('alice') // true
* xor.has('bob') // false
* ```
*/
add(elements: HashableInput[]) {
if (elements.length != this._size) {
if (elements.length !== this._size) {
throw new Error(
`This filter has been created for exactly ${this._size} elements`
)
Expand Down Expand Up @@ -164,7 +184,7 @@ export default class XorFilter extends BaseFilter {
}
// now check each entry of the filter
let broken = true
let i = 0
let i = 0
while (broken && i < this._filter.length) {
if (!filter._filter[i].equals(this._filter[i])) {
broken = false
Expand Down Expand Up @@ -413,7 +433,7 @@ export default class XorFilter extends BaseFilter {
}
}
// the value is in 32 bits format, so we must cast it to the desired number of bytes
let buf = Buffer.from(allocateArray(4, 0))
const buf = Buffer.from(allocateArray(4, 0))
buf.writeInt32LE(xor)
this._filter[change] = buf.slice(0, this._bits / 8)
}
Expand Down

0 comments on commit ed02f08

Please sign in to comment.