-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-41: Add bloom filter for parquet #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
9ae5a9e
64e68c0
b169724
d198752
528b09a
886c605
b60cc1b
179c2a2
15c9d7d
f913b35
499d597
523d7b6
bef5438
b9443d9
e127c3f
f59258a
863875e
ddc18a7
84460c5
3b04d86
65f1057
f1de77d
54cc08d
3fb6b39
da4e39a
2fc7965
08eb0ce
2f57466
a00e770
5e23dab
c6d306d
2696f9e
6e5b78d
9fef1d8
a64a331
2667e08
4d58831
31a9ddc
809edf0
92661a4
8c9851c
952c263
2174041
d9ee1b9
af854cf
a5b8426
ea4ac56
f0fa7c1
a7e6b28
709e25e
0fdd35a
3e6cd14
2c17e6d
330f470
b013bc7
84e1488
d11ac72
9a38b9c
e08db11
0aa266a
66c5c59
626149a
05f8599
ec24e93
984455d
e20a2d2
5fc9400
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -511,6 +511,57 @@ struct ColumnMetaData { | |
| * This information can be used to determine if all data pages are | ||
| * dictionary encoded for example **/ | ||
| 13: optional list<PageEncodingStats> encoding_stats; | ||
|
|
||
| /** Byte offset from beginning of file to bloom filter data. The bloom filters | ||
| * data of columns together is stored before the start of row group wihch describe.**/ | ||
| 14: optional i64 bloom_filter_offset; | ||
| } | ||
|
|
||
| /** | ||
| * Definition of bloom filter algorithm. | ||
| */ | ||
| union BloomFilterAlgorithm { | ||
| /** The default value 0 represents Block based bloom filter. | ||
| * The bloom filter bitset is separated into tiny bucket as tiny bloom | ||
| * filter, the high 32 bits hash value is used to select bucket, and | ||
| * lower 32 bits hash values are used to set bits in tiny bloom filter. | ||
| * See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically, | ||
| * one tiny bloom filter contains eight 32-bit words (4 bytes stored in | ||
| * little endian), and the algorithm set one bit in each 32-bit word. | ||
| * | ||
| * In order to set bits in bucket, the algorithm need 8 SALT values | ||
| * (0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U, | ||
| * 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U) to calculate index with formular: | ||
| * index[i] = (hash * SALT[i]) >> 27 | ||
| **/ | ||
| 1: i32 algorithm = 0; | ||
| } | ||
|
|
||
| /** | ||
| * Definition for hash function used to compute hash of column value. | ||
| * Note that the hash function take plain encoding of column values as input. | ||
| */ | ||
| union BloomFilterHash { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't quite what I meant. Have a look at the way it is done in the logical types PR. https://github.com/apache/parquet-format/pull/51/files#diff-0f9d1b5347959e15259da7ba8f4b6252R264 That uses an empty struct in the union.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean in this way? It looks a bit strange and hard to understand from the code.. |
||
| /** The default value 0 represents Murmur3. | ||
| * Murmur3 hash has 32 bits and 128 bits hash variants, we use least significant | ||
| * 64 bits from its x64 128 bits function murmur3hash_x64_128 | ||
| **/ | ||
| 1: i32 hash_strategy = 0; | ||
| } | ||
|
|
||
| /** | ||
| * Bloom filter header is stored at beginning of bloom filter data of each column | ||
| * and followed by its bitset. | ||
| */ | ||
| struct BloomFilterHeader { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this header written? I'm assuming that when you seek to a bloom filter's offset, you should be able to read the header? Then the filter bytes follow the header? I think this should be more clear about where structures are placed in the file. |
||
| /** The size of bitset in bytes, must be a power of 2**/ | ||
| 1: required i32 numBytes; | ||
|
|
||
| /** The algorithm for setting bits. **/ | ||
| 2: required BloomFilterAlgorithm algorithm; | ||
|
|
||
| /** The hash function used for bloom filter. **/ | ||
| 3: required BloomFilterHash hash; | ||
| } | ||
|
|
||
| struct ColumnChunk { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite explain it, since index[i] will be a number between 0 and 31, inclusive, while the tiny bloom filter is 32 bytes, aka 256 bits. Please show that it is a split bloom filter over eight 32-bit words.