-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-41: Add bloom filter for parquet #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @rdblue |
|
@cjjnjust, thanks for getting this PR up. I'll take a look at it as soon as I have some time. |
src/main/thrift/parquet.thrift
Outdated
| 14: optional i64 bloom_filter_offset; | ||
| } | ||
|
|
||
| enum BloomFilterAlgorithm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file does not contain in it enough information to fully reconstruct the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jim
When we get bloom filter offset from the column chunk metadata, we can read the header defined to parse the HASH and ALGORITHM, and read BYTES of bitset. What other information you think we lack here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine that your design document was lost and the parquet-mr patch was lost. Shouldn't thi patch have enough information to support a complete reconstruction of the algorithm, as Ryan Blue mentioned in the call today?
This contains no information about the salt, about WHICH block-based bloom filter is used, the size of the blocks, the encoding of the data that is hashed, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jim, I added more info about this.
src/main/thrift/parquet.thrift
Outdated
| * lower 32 bits hash values are used to set bits in tiny bloom filter. | ||
| * See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically, | ||
| * the tiny bloom filter size is 32 bytes. The algorithm also needs 8 odd | ||
| * SALT values (0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doe snot describe enough about the SALT to be able to fully reconstruct the algorithm.
b1b8eb1 to
d198752
Compare
src/main/thrift/parquet.thrift
Outdated
| * In order to set bits in bucket, the algorithm need 8 SALT values | ||
| * (0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U, | ||
| * 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U) to calculate index with formular: | ||
| * index[i] = (hash * SALT[i]) >> 27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite explain it, since index[i] will be a number between 0 and 31, inclusive, while the tiny bloom filter is 32 bytes, aka 256 bits. Please show that it is a split bloom filter over eight 32-bit words.
src/main/thrift/parquet.thrift
Outdated
| 1: required i32 numBytes; | ||
|
|
||
| /** The algorithm for setting bits. **/ | ||
| 2: required BloomFilterAlgorithm bloomFilterAlgorithm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the names (but not the type names) of these fields can be simplified by removing "bloomFilter" from the front.
src/main/thrift/parquet.thrift
Outdated
| * filter, the high 32 bits hash value is used to select bucket, and | ||
| * lower 32 bits hash values are used to set bits in tiny bloom filter. | ||
| * See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically, | ||
| * one tiny bloom filter contains eight 32-bit words and the algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Also explain the endianness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments of hash say take plain encoding of content. We don't have to explain endianness here again, since algorithm itself is endianness irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain the endianness of how each 4-byte word is stored?
src/main/thrift/parquet.thrift
Outdated
| * dictionary encoded for example **/ | ||
| 13: optional list<PageEncodingStats> encoding_stats; | ||
|
|
||
| /** Byte offset from beginning of file to bloom filter data**/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should specify where the bloom filter data should be located for each column. I think the conclusion we came to was to put bloom filters for all columns together before the start of the row group the filters describe. That way we can avoid a seek if a bloom filter and row group both need to be read.
| MURMUR3 = 0; | ||
| } | ||
|
|
||
| struct BloomFilterHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this header written? I'm assuming that when you seek to a bloom filter's offset, you should be able to read the header? Then the filter bytes follow the header? I think this should be more clear about where structures are placed in the file.
src/main/thrift/parquet.thrift
Outdated
| * Definition for hash function used to compute hash of column value. | ||
| * Note that the hash function take plain encoding of column values as input. | ||
| */ | ||
| enum BloomFilterHash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, enums aren't forward-compatible in thrift, so we can't use them if we intend to add more entries later. That's why logical types are changing to use a union of structs instead: https://github.com/apache/parquet-format/pull/51/files#diff-0f9d1b5347959e15259da7ba8f4b6252R278.
Can you change these to use a union as well?
src/main/thrift/parquet.thrift
Outdated
| /** Murmur3 has 32 bits and 128 bits hash, we use lower 64 bits from | ||
| * Murmur3 128 bits function murmur3hash_x64_128 | ||
| **/ | ||
| MURMUR3 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be more specific. What byte order is used for multi-byte primitive types? I think this needs to cover each primitive type (other than int96) and provide a reference value for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a statement outside enumeration that hash function take plain encoding of value as input. Do we still need specify byte order? In parquet, plain encoding uses little endian of value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because the byte order of the result is not obvious and not obviously the same as the input.
| * Definition for hash function used to compute hash of column value. | ||
| * Note that the hash function take plain encoding of column values as input. | ||
| */ | ||
| union BloomFilterHash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite what I meant. Have a look at the way it is done in the logical types PR. https://github.com/apache/parquet-format/pull/51/files#diff-0f9d1b5347959e15259da7ba8f4b6252R264
That uses an empty struct in the union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in this way? It looks a bit strange and hard to understand from the code..
src/main/thrift/parquet.thrift
Outdated
| /** | ||
| * Hash strategy type annotation. | ||
| */ | ||
| struct HashStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be called Murmur3, then the union includes a Murmur3 called MURMUR3. I know it's weird, but that's how we're doing it elsewhere.
The spec says that varint-encode() is ULEB-128 encoding but links to VLQ algorithm that is slightly different from ULEB-128. Author: kostya-sh <[email protected]> Closes apache#69 from kostya-sh/patch-1 and squashes the following commits: f128603 [kostya-sh] PARQUET-1032: fix varint-encode() encoding algorithm link
Created like so: gpg --import < KEYS 2>&1 | grep key | sed -e 's/.*"\(.*\)".*/\1/' | \ while read k; do gpg --list-sigs --keyid-format long $k; gpg --export \ --armor $k; done > newkeys Author: Lars Volker <[email protected]> Closes apache#61 from lekv/full_keys and squashes the following commits: 89ac932 [Lars Volker] PARQUET-1076: Use long key ids in KEYS file
Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation: * Removed references to the `SIGNED` and `UNSIGNED` sort orders, which were removed in favour of a single `TYPE_ORDER`. * Removed obsolete references to `column_orders`'s effect on the `min` and `max` values, since those were declared obsolete instead and `column_orders` only affects the new `min_value` and `max_value` fields. * Clarified `ColumnOrder`'s purpose, since the purpose of a union containing a single empty struct was hard to grasp. Author: Zoltan Ivanfi <[email protected]> Closes apache#55 from zivanfi/master and squashes the following commits: a499d86 [Zoltan Ivanfi] Comparison rules updates. 0c973f7 [Zoltan Ivanfi] PARQUET-686: Further clarifications. f8fab0b [Zoltan Ivanfi] PARQUET-686: Minor improvements in Thrift comments. c86090d [Zoltan Ivanfi] PARQUET-686: Clarifications about min-max stats.
Author: LynnYuan <[email protected]> Closes apache#58 from LynnYuanInspur/lynn and squashes the following commits: 2001d05 [LynnYuan] PARQUET-1050 fix the comments mistake of struct DataPageHeaderV2
Author: Jakub Kukul <[email protected]> Closes apache#54 from jkukul/master and squashes the following commits: a2490b2 [Jakub Kukul] PARQUET-322 Document ENUM as a logical type.
…pressions Author: Zoltan Ivanfi <[email protected]> Closes apache#87 from zivanfi/PARQUET-1242 and squashes the following commits: 33cb102 [Zoltan Ivanfi] PARQUET-1242: parquet.thrift refers to wrong releases for the new compressions
…e#88) Describe handling of the ambigous min/max statistics for FLOAT/DOUBLE.
After moving to gitbox the old apache repo is not working anymore. The pom.xml had to be updated accordingly.
…2.5.0" This reverts commit a5b8426.
This PR is to add bloom filter structure to parquet.
bloom filter is built upon column and store at end of each row group
bloom filter size can be set by user or calculated automatically.
Bloom Field structure contains a bloom filter header and a bitset. The header specify length of bitset, main algorithm, and hash function applied.
Here's related design doc.