PARQUET-41: Add bloom filter for parquet #62

chenjunjiedada · 2017-08-28T07:27:22Z

This PR is to add bloom filter structure to parquet.

bloom filter is built upon column and store at end of each row group
bloom filter size can be set by user or calculated automatically.

Bloom Field structure contains a bloom filter header and a bitset. The header specify length of bitset, main algorithm, and hash function applied.

Here's related design doc.

chenjunjiedada · 2017-09-07T11:21:29Z

Hi @rdblue
Could you please help to review this?

rdblue · 2017-09-07T16:16:05Z

@cjjnjust, thanks for getting this PR up. I'll take a look at it as soon as I have some time.

jbapple-cloudera · 2017-09-13T21:25:55Z

src/main/thrift/parquet.thrift

+  14: optional i64 bloom_filter_offset;
+}
+
+enum BloomFilterAlgorithm {


This file does not contain in it enough information to fully reconstruct the algorithm.

Hi Jim
When we get bloom filter offset from the column chunk metadata, we can read the header defined to parse the HASH and ALGORITHM, and read BYTES of bitset. What other information you think we lack here?

Imagine that your design document was lost and the parquet-mr patch was lost. Shouldn't thi patch have enough information to support a complete reconstruction of the algorithm, as Ryan Blue mentioned in the call today?

This contains no information about the salt, about WHICH block-based bloom filter is used, the size of the blocks, the encoding of the data that is hashed, and so on.

Thanks Jim, I added more info about this.

jbapple-cloudera · 2017-09-14T04:58:03Z

src/main/thrift/parquet.thrift

+   * lower 32 bits hash values are used to set bits in tiny bloom filter.
+   * See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically, 
+   * the tiny bloom filter size is 32 bytes. The algorithm also needs 8 odd
+   * SALT values (0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU,


This doe snot describe enough about the SALT to be able to fully reconstruct the algorithm.

jbapple-cloudera · 2017-09-22T13:30:42Z

src/main/thrift/parquet.thrift

+   * In order to set bits in bucket, the algorithm need 8 SALT values 
+   * (0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U, 
+   * 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U) to calculate index with formular:
+   *                  index[i] = (hash * SALT[i]) >> 27 


This doesn't quite explain it, since index[i] will be a number between 0 and 31, inclusive, while the tiny bloom filter is 32 bytes, aka 256 bits. Please show that it is a split bloom filter over eight 32-bit words.

jbapple-cloudera · 2017-09-22T13:48:20Z

src/main/thrift/parquet.thrift

+  1: required i32 numBytes;
+
+  /** The algorithm for setting bits. **/
+  2: required BloomFilterAlgorithm bloomFilterAlgorithm;


I think the names (but not the type names) of these fields can be simplified by removing "bloomFilter" from the front.

jbapple-cloudera · 2017-09-24T03:16:56Z

src/main/thrift/parquet.thrift

+   * filter, the high 32 bits hash value is used to select bucket, and 
+   * lower 32 bits hash values are used to set bits in tiny bloom filter.
+   * See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically, 
+   * one tiny bloom filter contains eight 32-bit words and the algorithm


Looks good. Also explain the endianness?

The comments of hash say take plain encoding of content. We don't have to explain endianness here again, since algorithm itself is endianness irrelevant.

Maybe explain the endianness of how each 4-byte word is stored?

rdblue · 2017-09-25T18:11:55Z

src/main/thrift/parquet.thrift

   * dictionary encoded for example **/
  13: optional list<PageEncodingStats> encoding_stats;
+
+  /** Byte offset from beginning of file to bloom filter data**/


This should specify where the bloom filter data should be located for each column. I think the conclusion we came to was to put bloom filters for all columns together before the start of the row group the filters describe. That way we can avoid a seek if a bloom filter and row group both need to be read.

rdblue · 2017-09-25T18:16:31Z

src/main/thrift/parquet.thrift

+    MURMUR3 = 0;
+}
+
+struct BloomFilterHeader {


Where is this header written? I'm assuming that when you seek to a bloom filter's offset, you should be able to read the header? Then the filter bytes follow the header? I think this should be more clear about where structures are placed in the file.

rdblue · 2017-09-25T18:19:30Z

src/main/thrift/parquet.thrift

+ * Definition for hash function used to compute hash of column value.
+ * Note that the hash function take plain encoding of column values as input.
+ */
+enum BloomFilterHash {


Unfortunately, enums aren't forward-compatible in thrift, so we can't use them if we intend to add more entries later. That's why logical types are changing to use a union of structs instead: https://github.com/apache/parquet-format/pull/51/files#diff-0f9d1b5347959e15259da7ba8f4b6252R278.

Can you change these to use a union as well?

rdblue · 2017-09-25T18:26:06Z

src/main/thrift/parquet.thrift

+  /** Murmur3 has 32 bits and 128 bits hash, we use lower 64 bits from 
+   * Murmur3 128 bits function murmur3hash_x64_128  
+   **/
+    MURMUR3 = 0;


This needs to be more specific. What byte order is used for multi-byte primitive types? I think this needs to cover each primitive type (other than int96) and provide a reference value for each.

It has a statement outside enumeration that hash function take plain encoding of value as input. Do we still need specify byte order? In parquet, plain encoding uses little endian of value.

Yes, because the byte order of the result is not obvious and not obviously the same as the input.

rdblue · 2017-09-26T15:48:47Z

src/main/thrift/parquet.thrift

+ * Definition for hash function used to compute hash of column value.
+ * Note that the hash function take plain encoding of column values as input.
+ */
+union BloomFilterHash {


This isn't quite what I meant. Have a look at the way it is done in the logical types PR. https://github.com/apache/parquet-format/pull/51/files#diff-0f9d1b5347959e15259da7ba8f4b6252R264

That uses an empty struct in the union.

Do you mean in this way? It looks a bit strange and hard to understand from the code..

rdblue · 2017-09-27T17:39:03Z

src/main/thrift/parquet.thrift

+/**
+  * Hash strategy type annotation.
+  */
+struct HashStrategy {


This should be called Murmur3, then the union includes a Murmur3 called MURMUR3. I know it's weird, but that's how we're doing it elsewhere.

The spec says that varint-encode() is ULEB-128 encoding but links to VLQ algorithm that is slightly different from ULEB-128. Author: kostya-sh <[email protected]> Closes apache#69 from kostya-sh/patch-1 and squashes the following commits: f128603 [kostya-sh] PARQUET-1032: fix varint-encode() encoding algorithm link

Created like so: gpg --import < KEYS 2>&1 | grep key | sed -e 's/.*"$.*$".*/\1/' | \ while read k; do gpg --list-sigs --keyid-format long $k; gpg --export \ --armor $k; done > newkeys Author: Lars Volker <[email protected]> Closes apache#61 from lekv/full_keys and squashes the following commits: 89ac932 [Lars Volker] PARQUET-1076: Use long key ids in KEYS file

Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation: * Removed references to the `SIGNED` and `UNSIGNED` sort orders, which were removed in favour of a single `TYPE_ORDER`. * Removed obsolete references to `column_orders`'s effect on the `min` and `max` values, since those were declared obsolete instead and `column_orders` only affects the new `min_value` and `max_value` fields. * Clarified `ColumnOrder`'s purpose, since the purpose of a union containing a single empty struct was hard to grasp. Author: Zoltan Ivanfi <[email protected]> Closes apache#55 from zivanfi/master and squashes the following commits: a499d86 [Zoltan Ivanfi] Comparison rules updates. 0c973f7 [Zoltan Ivanfi] PARQUET-686: Further clarifications. f8fab0b [Zoltan Ivanfi] PARQUET-686: Minor improvements in Thrift comments. c86090d [Zoltan Ivanfi] PARQUET-686: Clarifications about min-max stats.

This merges changes from PARQUET-1024 in parquet-mr into parquet-format. Also cleaning up old PRs: Closes apache#29 Closes apache#60

Author: LynnYuan <[email protected]> Closes apache#58 from LynnYuanInspur/lynn and squashes the following commits: 2001d05 [LynnYuan] PARQUET-1050 fix the comments mistake of struct DataPageHeaderV2

Author: Jakub Kukul <[email protected]> Closes apache#54 from jkukul/master and squashes the following commits: a2490b2 [Jakub Kukul] PARQUET-322 Document ENUM as a logical type.

…pressions Author: Zoltan Ivanfi <[email protected]> Closes apache#87 from zivanfi/PARQUET-1242 and squashes the following commits: 33cb102 [Zoltan Ivanfi] PARQUET-1242: parquet.thrift refers to wrong releases for the new compressions

…e#88) Describe handling of the ambigous min/max statistics for FLOAT/DOUBLE.

After moving to gitbox the old apache repo is not working anymore. The pom.xml had to be updated accordingly.

…pache#91)

…2.5.0" This reverts commit a5b8426.

… into PARQUET-41

Junjie chen added 2 commits August 28, 2017 16:05

PARQUET-41: Add bloom filter for parquet

9ae5a9e

change version back to 2.3.2-SNAPSHOT

64e68c0

jbapple-cloudera reviewed Sep 13, 2017

View reviewed changes

PARQUET-41: Add more info about algorithm

b169724

jbapple-cloudera reviewed Sep 14, 2017

View reviewed changes

PARQUET-41: Add more info for algorithm

d198752

chenjunjiedada force-pushed the PARQUET-41 branch 2 times, most recently from b1b8eb1 to d198752 Compare September 14, 2017 08:49

jbapple-cloudera suggested changes Sep 22, 2017

View reviewed changes

PARQUET-41: Refine the comment

528b09a

jbapple-cloudera reviewed Sep 24, 2017

View reviewed changes

PARQUET-41: Update tiny bloom filter endianness

886c605

rdblue reviewed Sep 25, 2017

View reviewed changes

Junjie Chen added 2 commits September 26, 2017 14:33

PARQUET-41: Change Enum to Union to support forward compatibility

b60cc1b

PARQUET-41: Update comments

179c2a2

rdblue reviewed Sep 26, 2017

View reviewed changes

PARQUET-41: Use empty struct annotation to replace enum

15c9d7d

rdblue reviewed Sep 27, 2017

View reviewed changes

Junjie Chen and others added 7 commits September 28, 2017 08:06

PARQUET-41: update naming

f913b35

PARQUET-1024: Allow case-insensitive parquet-xxx prefix in PR title.

b9443d9

This merges changes from PARQUET-1024 in parquet-mr into parquet-format. Also cleaning up old PRs: Closes apache#29 Closes apache#60

PARQUET-1050 fix the comments mistake of struct DataPageHeaderV2

e127c3f

Author: LynnYuan <[email protected]> Closes apache#58 from LynnYuanInspur/lynn and squashes the following commits: 2001d05 [LynnYuan] PARQUET-1050 fix the comments mistake of struct DataPageHeaderV2

PARQUET-322 Document ENUM as a logical type.

f59258a

Author: Jakub Kukul <[email protected]> Closes apache#54 from jkukul/master and squashes the following commits: a2490b2 [Jakub Kukul] PARQUET-322 Document ENUM as a logical type.

Zoltan Ivanfi and others added 27 commits March 23, 2018 14:55

PARQUET-1251: Clarify ambiguous min/max stats for FLOAT/DOUBLE (apach…

952c263

…e#88) Describe handling of the ambigous min/max statistics for FLOAT/DOUBLE.

PARQUET-1258: Update scm developer connection to github (apache#90)

2174041

After moving to gitbox the old apache repo is not working anymore. The pom.xml had to be updated accordingly.

PARQUET-1260: Add Zoltan Ivanfi's code signing key to the KEYS file (a…

d9ee1b9

…pache#91)

PARQUET-1234: Update CHANGES.md.

af854cf

[maven-release-plugin] prepare release apache-parquet-format-2.5.0

a5b8426

Revert "[maven-release-plugin] prepare release apache-parquet-format-…

ea4ac56

…2.5.0" This reverts commit a5b8426.

[maven-release-plugin] prepare release apache-parquet-format-2.5.0

f0fa7c1

[maven-release-plugin] prepare for next development iteration

a7e6b28

PARQUET-1290: clarify run lengths for RLE encoding (apache#96)

709e25e

PARQUET-1294: Update release scripts for the new Apache policy

0fdd35a

PARQUET-1266: LogicalTypes union in parquet-format doesn't include UUID

3e6cd14

PARQUET-41: add bloom filter

2c17e6d

PARQUET-41: Add bloom filter for parquet

330f470

change version back to 2.3.2-SNAPSHOT

b013bc7

PARQUET-41: Add more info about algorithm

84e1488

PARQUET-41: Add more info for algorithm

d11ac72

PARQUET-41: Refine the comment

9a38b9c

PARQUET-41: Update tiny bloom filter endianness

e08db11

PARQUET-41: Change Enum to Union to support forward compatibility

0aa266a

PARQUET-41: Update comments

66c5c59

PARQUET-41: Use empty struct annotation to replace enum

626149a

PARQUET-41: update naming

05f8599

PARQUET-41: rebase to master

ec24e93

Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-format…

984455d

… into PARQUET-41

Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-format…

e20a2d2

… into PARQUET-41

Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-format…

5fc9400

… into PARQUET-41

chenjunjiedada mentioned this pull request Jun 20, 2018

PARQUET-41: add bloom filter #99

Closed

chenjunjiedada closed this Sep 25, 2018

asfimport mentioned this pull request Jun 23, 2024

Add bloom filters to parquet statistics apache/parquet-java#1468

Closed

17 tasks

PARQUET-41: Add bloom filter for parquet #62

PARQUET-41: Add bloom filter for parquet #62

Uh oh!

Conversation

chenjunjiedada commented Aug 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenjunjiedada commented Sep 7, 2017

Uh oh!

rdblue commented Sep 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

chenjunjiedada commented Aug 28, 2017 •

edited

Loading

rdblue Sep 25, 2017 •

edited

Loading