Skip to content
Closed
Changes from 2 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
9ae5a9e
PARQUET-41: Add bloom filter for parquet
Aug 12, 2017
64e68c0
change version back to 2.3.2-SNAPSHOT
Aug 28, 2017
b169724
PARQUET-41: Add more info about algorithm
Sep 14, 2017
d198752
PARQUET-41: Add more info for algorithm
Sep 14, 2017
528b09a
PARQUET-41: Refine the comment
Sep 23, 2017
886c605
PARQUET-41: Update tiny bloom filter endianness
Sep 24, 2017
b60cc1b
PARQUET-41: Change Enum to Union to support forward compatibility
Sep 26, 2017
179c2a2
PARQUET-41: Update comments
Sep 26, 2017
15c9d7d
PARQUET-41: Use empty struct annotation to replace enum
Sep 27, 2017
f913b35
PARQUET-41: update naming
Sep 28, 2017
499d597
PARQUET-1032: fix varint-encode() encoding algorithm link
kostya-sh Oct 6, 2017
523d7b6
PARQUET-1076: Use long key ids in KEYS file
Oct 6, 2017
bef5438
PARQUET-686: Clarifications about min-max stats.
Oct 6, 2017
b9443d9
PARQUET-1024: Allow case-insensitive parquet-xxx prefix in PR title.
rdblue Oct 6, 2017
e127c3f
PARQUET-1050 fix the comments mistake of struct DataPageHeaderV2
Oct 6, 2017
f59258a
PARQUET-322 Document ENUM as a logical type.
jkukul Oct 6, 2017
863875e
PARQUET-906: Add LogicalType annotation.
rdblue Oct 10, 2017
ddc18a7
PARQUET-1125: Add UUID logical type.
rdblue Oct 10, 2017
84460c5
PARQUET-1124: Add LZ4 and Zstd compression codecs.
rdblue Oct 10, 2017
3b04d86
PARQUET-1031: Fix spelling errors, whitespace, GitHub urls
Mistobaan Oct 11, 2017
65f1057
PARQUET-1136: Fix path to parquet.thrift in Makefile
Oct 12, 2017
f1de77d
PARQUET-922: Add column indexes to parquet.thrift
Oct 16, 2017
54cc08d
PARQUET-1134: Update CHANGES.md.
rdblue Oct 17, 2017
3fb6b39
[maven-release-plugin] prepare release apache-parquet-format-2.4.0
rdblue Oct 17, 2017
da4e39a
[maven-release-plugin] prepare for next development iteration
rdblue Oct 17, 2017
2fc7965
PARQUET-1144: Remove slf4j-nop.
rdblue Oct 17, 2017
08eb0ce
[maven-release-plugin] prepare release apache-parquet-format-2.4.0
rdblue Oct 17, 2017
2f57466
[maven-release-plugin] prepare for next development iteration
rdblue Oct 17, 2017
a00e770
PARQUET-1145: Add license to .gitignore
Nov 13, 2017
5e23dab
PARQUET-1156: Address dev/merge_parquet_pr.py problems.
Jan 9, 2018
c6d306d
PARQUET-1064: Deprecate type-defined sort ordering for INTERVAL type.
Jan 9, 2018
2696f9e
PARQUET-1171: Clarify scope of usage for RLE, BIT_PACKED encodings
wesm Jan 10, 2018
6e5b78d
PARQUET-1065: Deprecate type-defined sort ordering for INT96 type
Jan 11, 2018
9fef1d8
PARQUET-1197: Log rat failures
Jan 18, 2018
a64a331
PARQUET-1201: Implement page indexes
Feb 13, 2018
2667e08
PARQUET-323: Mark INT96 as deprecated
Mar 13, 2018
4d58831
PARQUET-1236: Align version of slf4j-api
1028332163 Mar 21, 2018
31a9ddc
Update Encodings.md with RLE_DICTIONARY
timarmstrong Mar 22, 2018
809edf0
Merge pull request #86 from lekv/p323
lekv Mar 22, 2018
92661a4
Merge pull request #89 from timarmstrong/master
lekv Mar 23, 2018
8c9851c
PARQUET-1242: parquet.thrift refers to wrong releases for the new com…
Mar 23, 2018
952c263
PARQUET-1251: Clarify ambiguous min/max stats for FLOAT/DOUBLE (#88)
gszadovszky Mar 26, 2018
2174041
PARQUET-1258: Update scm developer connection to github (#90)
gszadovszky Mar 28, 2018
d9ee1b9
PARQUET-1260: Add Zoltan Ivanfi's code signing key to the KEYS file (…
zivanfi Mar 29, 2018
af854cf
PARQUET-1234: Update CHANGES.md.
Mar 26, 2018
a5b8426
[maven-release-plugin] prepare release apache-parquet-format-2.5.0
Mar 29, 2018
ea4ac56
Revert "[maven-release-plugin] prepare release apache-parquet-format-…
Mar 29, 2018
f0fa7c1
[maven-release-plugin] prepare release apache-parquet-format-2.5.0
Mar 29, 2018
a7e6b28
[maven-release-plugin] prepare for next development iteration
Mar 29, 2018
709e25e
PARQUET-1290: clarify run lengths for RLE encoding (#96)
timarmstrong May 7, 2018
0fdd35a
PARQUET-1294: Update release scripts for the new Apache policy
May 10, 2018
3e6cd14
PARQUET-1266: LogicalTypes union in parquet-format doesn't include UUID
nkollar Apr 5, 2018
2c17e6d
PARQUET-41: add bloom filter
Jun 3, 2018
330f470
PARQUET-41: Add bloom filter for parquet
Aug 12, 2017
b013bc7
change version back to 2.3.2-SNAPSHOT
Aug 28, 2017
84e1488
PARQUET-41: Add more info about algorithm
Sep 14, 2017
d11ac72
PARQUET-41: Add more info for algorithm
Sep 14, 2017
9a38b9c
PARQUET-41: Refine the comment
Sep 23, 2017
e08db11
PARQUET-41: Update tiny bloom filter endianness
Sep 24, 2017
0aa266a
PARQUET-41: Change Enum to Union to support forward compatibility
Sep 26, 2017
66c5c59
PARQUET-41: Update comments
Sep 26, 2017
626149a
PARQUET-41: Use empty struct annotation to replace enum
Sep 27, 2017
05f8599
PARQUET-41: update naming
Sep 28, 2017
ec24e93
PARQUET-41: rebase to master
Jun 19, 2018
984455d
Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-form…
Jun 19, 2018
e20a2d2
Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-form…
Jun 19, 2018
5fc9400
Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-form…
Jun 19, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -511,6 +511,33 @@ struct ColumnMetaData {
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
13: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to bloom filter data**/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should specify where the bloom filter data should be located for each column. I think the conclusion we came to was to put bloom filters for all columns together before the start of the row group the filters describe. That way we can avoid a seek if a bloom filter and row group both need to be read.

14: optional i64 bloom_filter_offset;
}

enum BloomFilterAlgorithm {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file does not contain in it enough information to fully reconstruct the algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jim
When we get bloom filter offset from the column chunk metadata, we can read the header defined to parse the HASH and ALGORITHM, and read BYTES of bitset. What other information you think we lack here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine that your design document was lost and the parquet-mr patch was lost. Shouldn't thi patch have enough information to support a complete reconstruction of the algorithm, as Ryan Blue mentioned in the call today?

This contains no information about the salt, about WHICH block-based bloom filter is used, the size of the blocks, the encoding of the data that is hashed, and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jim, I added more info about this.

/** Block based bloom filter. **/
BLOCK = 0;
}

enum BloomFilterHash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, enums aren't forward-compatible in thrift, so we can't use them if we intend to add more entries later. That's why logical types are changing to use a union of structs instead: https://github.com/apache/parquet-format/pull/51/files#diff-0f9d1b5347959e15259da7ba8f4b6252R278.

Can you change these to use a union as well?

/** The hash function used to compute hash of column value,
* murmur3 has 32 bits and 128 bits hash, we use lower 64 bits from
* murmur3 128 bits function murmur3hash_x64_128
**/
MURMUR3 = 0;
Copy link
Contributor

@rdblue rdblue Sep 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be more specific. What byte order is used for multi-byte primitive types? I think this needs to cover each primitive type (other than int96) and provide a reference value for each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a statement outside enumeration that hash function take plain encoding of value as input. Do we still need specify byte order? In parquet, plain encoding uses little endian of value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the byte order of the result is not obvious and not obviously the same as the input.

}

struct BloomFilterHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this header written? I'm assuming that when you seek to a bloom filter's offset, you should be able to read the header? Then the filter bytes follow the header? I think this should be more clear about where structures are placed in the file.

/** The size of bitset in bytes, must be a power of 2 and larger than 32**/
1: required i32 numBytes;

/** The algorithm for setting bits. **/
2: required BloomFilterAlgorithm bloomFilterAlgorithm;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the names (but not the type names) of these fields can be simplified by removing "bloomFilter" from the front.


/** The hash function used for bloom filter. **/
3: required BloomFilterHash bloomFilterHash;
}

struct ColumnChunk {
Expand Down