Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions BloomFilter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
-->

Parquet Bloom Filter
===
### Problem statement
In their current format, column statistics and dictionaries can be used for predicate
pushdown. Statistics include minimum and maximum value, which can be used to filter out
values not in the range. Dictionaries are more specific, and readers can filter out values
that are between min and max but not in the dictionary. However, when there are too many
distinct values, writers sometimes choose not to add dictionaries because of the extra
space they occupy. This leaves columns with large cardinalities and widely separated min
and max without support for predicate pushdown.

A Bloom filter[1] is a compact data structure that overapproximates a set. It can respond
to membership queries with either "definitely no" or "probably yes", where the probability
of false positives is configured when the filter is initialized. Bloom filters do not have
false negatives.

Because Bloom filters are small compared to dictionaries, they can be used for predicate
pushdown even in columns with high cardinality and when space is at a premium.

### Goal
* Enable predicate pushdown for high-cardinality columns while using less space than
dictionaries.

* Induce no additional I/O overhead when executing queries on columns without Bloom
filters attached or when executing non-selective queries.

### Technical Approach
The initial Bloom filter algorithm in Parquet is implemented using a combination of two
Bloom filter techniques.

First, the block Bloom filter algorithm from Putze et al.'s "Cache-, Hash- and
Space-Efficient Bloom filters"[2] is used. This divides a filter into many tiny Bloom
filters, each one of which is called a "block". In Parquet's initial implementation, each
block is 256 bits. When inserting or finding a value, part of the hash of that value is
used to index into the array of blocks and pick a single one. This single block is then
used for the remaining part of the operation.

Second, within each block, this implementation uses the folklore split Bloom filter
technique, as described in section 2.1 of "Network Applications of Bloom Filters: A
Survey"[5]. This divides the 256 bits in each block up into eight contiguous 32-bit lanes
and sets or checks one bit in each lane.

#### Algorithm
In the initial algorithm, the most significant 32 bits from the hash value are used as the
index to select a block from bitset. The lower 32 bits of the hash value, along with eight
constant salt values, are used to compute the bit to set in each lane of the block. The
salt and lower 32 bits are combined using the multiply-shift[3] hash function:

```c
// 8 SALT values used to compute bit pattern
static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU,
0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};

// key: the lower 32 bits of hash result
// mask: the output bit pattern for a tiny Bloom filter
void Mask(uint32_t key, uint32_t mask[8]) {
for (int i = 0; i < 8; ++i) {
mask[i] = key * SALT[i];
}
for (int i = 0; i < 8; ++i) {
mask[i] = mask[i] >> 27;
}
for (int i = 0; i < 8; ++i) {
mask[i] = UINT32_C(1) << mask[i];
}
}

```

#### Hash Function
The function used to hash values in the initial implementation is MurmurHash3[4], using
the least-significant 64 bits of the 128-bit version of the function on the x86-64
platform. Note that the function produces different values on different architectures, so
implementors must be careful to use the version specific to x86-64. That function can be
emulated on different platforms without difficulty.

#### Build a Bloom filter
The fact that exactly eight bits are checked during each lookup means that these filters
are most space efficient when used with an expected false positive rate of about
0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted
into the filter.

To calculate the size the filter should be for another false positive rate `p`, use the
following formula. The output is in bits per distinct element:

```c
-8 / log(1 - pow(p, 1.0 / 8));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the source of this formula? I tried substituting 0.5% into this and it did not come out to 11.54. I probably missed something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the classic formula for Bloom Filters. See the Network Applications paper at the bottom for a proof.

It's very sensitive to p, so 0.4% is closer, and 0.39% closer still.

```

#### File Format
The Bloom filter data of a column is stored at the beginning of its column chunk in the
row group. The column chunk metadata contains the Bloom filter offset. The Bloom filter is
stored with a header containing the size of the filter in bytes, the algorithm, and the
hash function.

### Reference
1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
37 changes: 37 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -475,6 +475,7 @@ enum PageType {
INDEX_PAGE = 1;
DICTIONARY_PAGE = 2;
DATA_PAGE_V2 = 3;
BLOOM_FILTER_PAGE = 4;
}

/**
Expand Down Expand Up @@ -554,6 +555,38 @@ struct DataPageHeaderV2 {
8: optional Statistics statistics;
}

/** Block-based algorithm type annotation. **/
struct SplitBlockAlgorithm {}
/** The algorithm used in Bloom filter. **/
union BloomFilterAlgorithm {
/** Block-based Bloom filter. **/
1: SplitBlockAlgorithm BLOCK;
}
/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
* repo by Austin Appleby.
**/
struct Murmur3 {}
/**
* The hash function used in Bloom filter. This function takes the hash of a column value
* using plain encoding.
**/
union BloomFilterHash {
/** Murmur3 Hash Strategy. **/
1: Murmur3 MURMUR3;
}
/**
* Bloom filter header is stored at beginning of Bloom filter data of each column
* and followed by its bitset.
**/
struct BloomFilterPageHeader {
/** The size of bitset in bytes **/
1: required i32 numBytes;
/** The algorithm for setting bits. **/
2: required BloomFilterAlgorithm algorithm;
/** The hash function used for Bloom filter. **/
3: required BloomFilterHash hash;
}

struct PageHeader {
/** the type of the page: indicates which of the *_header fields is set **/
1: required PageType type
Expand All @@ -574,6 +607,7 @@ struct PageHeader {
6: optional IndexPageHeader index_page_header;
7: optional DictionaryPageHeader dictionary_page_header;
8: optional DataPageHeaderV2 data_page_header_v2;
9: optional BloomFilterPageHeader bloom_filter_page_header;
}

/**
Expand Down Expand Up @@ -660,6 +694,9 @@ struct ColumnMetaData {
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
13: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to Bloom filter data. **/
14: optional i64 bloom_filter_offset;
}

struct ColumnChunk {
Expand Down