-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-41: Add bloom filter for parquet #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
9ae5a9e
64e68c0
b169724
d198752
528b09a
886c605
b60cc1b
179c2a2
15c9d7d
f913b35
499d597
523d7b6
bef5438
b9443d9
e127c3f
f59258a
863875e
ddc18a7
84460c5
3b04d86
65f1057
f1de77d
54cc08d
3fb6b39
da4e39a
2fc7965
08eb0ce
2f57466
a00e770
5e23dab
c6d306d
2696f9e
6e5b78d
9fef1d8
a64a331
2667e08
4d58831
31a9ddc
809edf0
92661a4
8c9851c
952c263
2174041
d9ee1b9
af854cf
a5b8426
ea4ac56
f0fa7c1
a7e6b28
709e25e
0fdd35a
3e6cd14
2c17e6d
330f470
b013bc7
84e1488
d11ac72
9a38b9c
e08db11
0aa266a
66c5c59
626149a
05f8599
ec24e93
984455d
e20a2d2
5fc9400
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -511,6 +511,33 @@ struct ColumnMetaData { | |
| * This information can be used to determine if all data pages are | ||
| * dictionary encoded for example **/ | ||
| 13: optional list<PageEncodingStats> encoding_stats; | ||
|
|
||
| /** Byte offset from beginning of file to bloom filter data**/ | ||
| 14: optional i64 bloom_filter_offset; | ||
| } | ||
|
|
||
| enum BloomFilterAlgorithm { | ||
|
||
| /** Block based bloom filter. **/ | ||
| BLOCK = 0; | ||
| } | ||
|
|
||
| enum BloomFilterHash { | ||
|
||
| /** The hash function used to compute hash of column value, | ||
| * murmur3 has 32 bits and 128 bits hash, we use lower 64 bits from | ||
| * murmur3 128 bits function murmur3hash_x64_128 | ||
| **/ | ||
| MURMUR3 = 0; | ||
|
||
| } | ||
|
|
||
| struct BloomFilterHeader { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this header written? I'm assuming that when you seek to a bloom filter's offset, you should be able to read the header? Then the filter bytes follow the header? I think this should be more clear about where structures are placed in the file. |
||
| /** The size of bitset in bytes, must be a power of 2 and larger than 32**/ | ||
| 1: required i32 numBytes; | ||
|
|
||
| /** The algorithm for setting bits. **/ | ||
| 2: required BloomFilterAlgorithm bloomFilterAlgorithm; | ||
|
||
|
|
||
| /** The hash function used for bloom filter. **/ | ||
| 3: required BloomFilterHash bloomFilterHash; | ||
| } | ||
|
|
||
| struct ColumnChunk { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should specify where the bloom filter data should be located for each column. I think the conclusion we came to was to put bloom filters for all columns together before the start of the row group the filters describe. That way we can avoid a seek if a bloom filter and row group both need to be read.