Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
9ae5a9e
PARQUET-41: Add bloom filter for parquet
Aug 12, 2017
64e68c0
change version back to 2.3.2-SNAPSHOT
Aug 28, 2017
b169724
PARQUET-41: Add more info about algorithm
Sep 14, 2017
d198752
PARQUET-41: Add more info for algorithm
Sep 14, 2017
528b09a
PARQUET-41: Refine the comment
Sep 23, 2017
886c605
PARQUET-41: Update tiny bloom filter endianness
Sep 24, 2017
b60cc1b
PARQUET-41: Change Enum to Union to support forward compatibility
Sep 26, 2017
179c2a2
PARQUET-41: Update comments
Sep 26, 2017
15c9d7d
PARQUET-41: Use empty struct annotation to replace enum
Sep 27, 2017
f913b35
PARQUET-41: update naming
Sep 28, 2017
499d597
PARQUET-1032: fix varint-encode() encoding algorithm link
kostya-sh Oct 6, 2017
523d7b6
PARQUET-1076: Use long key ids in KEYS file
Oct 6, 2017
bef5438
PARQUET-686: Clarifications about min-max stats.
Oct 6, 2017
b9443d9
PARQUET-1024: Allow case-insensitive parquet-xxx prefix in PR title.
rdblue Oct 6, 2017
e127c3f
PARQUET-1050 fix the comments mistake of struct DataPageHeaderV2
Oct 6, 2017
f59258a
PARQUET-322 Document ENUM as a logical type.
jkukul Oct 6, 2017
863875e
PARQUET-906: Add LogicalType annotation.
rdblue Oct 10, 2017
ddc18a7
PARQUET-1125: Add UUID logical type.
rdblue Oct 10, 2017
84460c5
PARQUET-1124: Add LZ4 and Zstd compression codecs.
rdblue Oct 10, 2017
3b04d86
PARQUET-1031: Fix spelling errors, whitespace, GitHub urls
Mistobaan Oct 11, 2017
65f1057
PARQUET-1136: Fix path to parquet.thrift in Makefile
Oct 12, 2017
f1de77d
PARQUET-922: Add column indexes to parquet.thrift
Oct 16, 2017
54cc08d
PARQUET-1134: Update CHANGES.md.
rdblue Oct 17, 2017
3fb6b39
[maven-release-plugin] prepare release apache-parquet-format-2.4.0
rdblue Oct 17, 2017
da4e39a
[maven-release-plugin] prepare for next development iteration
rdblue Oct 17, 2017
2fc7965
PARQUET-1144: Remove slf4j-nop.
rdblue Oct 17, 2017
08eb0ce
[maven-release-plugin] prepare release apache-parquet-format-2.4.0
rdblue Oct 17, 2017
2f57466
[maven-release-plugin] prepare for next development iteration
rdblue Oct 17, 2017
a00e770
PARQUET-1145: Add license to .gitignore
Nov 13, 2017
5e23dab
PARQUET-1156: Address dev/merge_parquet_pr.py problems.
Jan 9, 2018
c6d306d
PARQUET-1064: Deprecate type-defined sort ordering for INTERVAL type.
Jan 9, 2018
2696f9e
PARQUET-1171: Clarify scope of usage for RLE, BIT_PACKED encodings
wesm Jan 10, 2018
6e5b78d
PARQUET-1065: Deprecate type-defined sort ordering for INT96 type
Jan 11, 2018
9fef1d8
PARQUET-1197: Log rat failures
Jan 18, 2018
a64a331
PARQUET-1201: Implement page indexes
Feb 13, 2018
2667e08
PARQUET-323: Mark INT96 as deprecated
Mar 13, 2018
4d58831
PARQUET-1236: Align version of slf4j-api
1028332163 Mar 21, 2018
31a9ddc
Update Encodings.md with RLE_DICTIONARY
timarmstrong Mar 22, 2018
809edf0
Merge pull request #86 from lekv/p323
lekv Mar 22, 2018
92661a4
Merge pull request #89 from timarmstrong/master
lekv Mar 23, 2018
8c9851c
PARQUET-1242: parquet.thrift refers to wrong releases for the new com…
Mar 23, 2018
952c263
PARQUET-1251: Clarify ambiguous min/max stats for FLOAT/DOUBLE (#88)
gszadovszky Mar 26, 2018
2174041
PARQUET-1258: Update scm developer connection to github (#90)
gszadovszky Mar 28, 2018
d9ee1b9
PARQUET-1260: Add Zoltan Ivanfi's code signing key to the KEYS file (…
zivanfi Mar 29, 2018
af854cf
PARQUET-1234: Update CHANGES.md.
Mar 26, 2018
a5b8426
[maven-release-plugin] prepare release apache-parquet-format-2.5.0
Mar 29, 2018
ea4ac56
Revert "[maven-release-plugin] prepare release apache-parquet-format-…
Mar 29, 2018
f0fa7c1
[maven-release-plugin] prepare release apache-parquet-format-2.5.0
Mar 29, 2018
a7e6b28
[maven-release-plugin] prepare for next development iteration
Mar 29, 2018
709e25e
PARQUET-1290: clarify run lengths for RLE encoding (#96)
timarmstrong May 7, 2018
0fdd35a
PARQUET-1294: Update release scripts for the new Apache policy
May 10, 2018
3e6cd14
PARQUET-1266: LogicalTypes union in parquet-format doesn't include UUID
nkollar Apr 5, 2018
2c17e6d
PARQUET-41: add bloom filter
Jun 3, 2018
330f470
PARQUET-41: Add bloom filter for parquet
Aug 12, 2017
b013bc7
change version back to 2.3.2-SNAPSHOT
Aug 28, 2017
84e1488
PARQUET-41: Add more info about algorithm
Sep 14, 2017
d11ac72
PARQUET-41: Add more info for algorithm
Sep 14, 2017
9a38b9c
PARQUET-41: Refine the comment
Sep 23, 2017
e08db11
PARQUET-41: Update tiny bloom filter endianness
Sep 24, 2017
0aa266a
PARQUET-41: Change Enum to Union to support forward compatibility
Sep 26, 2017
66c5c59
PARQUET-41: Update comments
Sep 26, 2017
626149a
PARQUET-41: Use empty struct annotation to replace enum
Sep 27, 2017
05f8599
PARQUET-41: update naming
Sep 28, 2017
ec24e93
PARQUET-41: rebase to master
Jun 19, 2018
984455d
Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-form…
Jun 19, 2018
e20a2d2
Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-form…
Jun 19, 2018
5fc9400
Merge branch 'PARQUET-41' of https://github.com/cjjnjust/parquet-form…
Jun 19, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

generated/*
target
dependency-reduced-pom.xml
Expand Down
17 changes: 17 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

language: java
dist: precise
before_install:
Expand Down
64 changes: 64 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,70 @@

# Parquet #

### Version 2.5.0 ###

#### Bug

* [PARQUET-323](https://issues.apache.org/jira/browse/PARQUET-323) - INT96 should be marked as deprecated
* [PARQUET-1064](https://issues.apache.org/jira/browse/PARQUET-1064) - Deprecate type-defined sort ordering for INTERVAL type
* [PARQUET-1065](https://issues.apache.org/jira/browse/PARQUET-1065) - Deprecate type-defined sort ordering for INT96 type
* [PARQUET-1145](https://issues.apache.org/jira/browse/PARQUET-1145) - Add license to .gitignore and .travis.yml
* [PARQUET-1156](https://issues.apache.org/jira/browse/PARQUET-1156) - dev/merge\_parquet\_pr.py problems
* [PARQUET-1236](https://issues.apache.org/jira/browse/PARQUET-1236) - Upgrade org.slf4j:slf4j-api:1.7.2 to 1.7.12
* [PARQUET-1242](https://issues.apache.org/jira/browse/PARQUET-1242) - parquet.thrift refers to wrong releases for the new compressions
* [PARQUET-1251](https://issues.apache.org/jira/browse/PARQUET-1251) - Clarify ambiguous min/max stats for FLOAT/DOUBLE
* [PARQUET-1258](https://issues.apache.org/jira/browse/PARQUET-1258) - Update scm developer connection to github

#### New Feature

* [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Write column indexes

#### Improvement

* [PARQUET-1171](https://issues.apache.org/jira/browse/PARQUET-1171) - \[C++\] Clarify valid uses for RLE, BIT_PACKED encodings
* [PARQUET-1197](https://issues.apache.org/jira/browse/PARQUET-1197) - Log rat failures

#### Task

* [PARQUET-1234](https://issues.apache.org/jira/browse/PARQUET-1234) - Release Parquet format 2.5.0

### Version 2.4.0 ###

#### Bug

* [PARQUET-255](https://issues.apache.org/jira/browse/PARQUET-255) - Typo in decimal type specification
* [PARQUET-322](https://issues.apache.org/jira/browse/PARQUET-322) - Document ENUM as a logical type
* [PARQUET-412](https://issues.apache.org/jira/browse/PARQUET-412) - Format: Do not shade slf4j-api
* [PARQUET-419](https://issues.apache.org/jira/browse/PARQUET-419) - Update dev script in parquet-cpp to remove incubator.
* [PARQUET-655](https://issues.apache.org/jira/browse/PARQUET-655) - The LogicalTypes.md link in README.md points to the old Parquet GitHub repository
* [PARQUET-1031](https://issues.apache.org/jira/browse/PARQUET-1031) - Fix spelling errors, whitespace, GitHub urls
* [PARQUET-1032](https://issues.apache.org/jira/browse/PARQUET-1032) - Change link in Encodings.md for variable length encoding
* [PARQUET-1050](https://issues.apache.org/jira/browse/PARQUET-1050) - The comment of Parquet Format Thrift definition file error
* [PARQUET-1076](https://issues.apache.org/jira/browse/PARQUET-1076) - [Format] Switch to long key ids in KEYs file
* [PARQUET-1091](https://issues.apache.org/jira/browse/PARQUET-1091) - Wrong and broken links in README
* [PARQUET-1102](https://issues.apache.org/jira/browse/PARQUET-1102) - Travis CI builds are failing for parquet-format PRs
* [PARQUET-1134](https://issues.apache.org/jira/browse/PARQUET-1134) - Release Parquet format 2.4.0
* [PARQUET-1136](https://issues.apache.org/jira/browse/PARQUET-1136) - Makefile is broken

#### Improvement

* [PARQUET-371](https://issues.apache.org/jira/browse/PARQUET-371) - Bumps Thrift version to 0.9.3
* [PARQUET-407](https://issues.apache.org/jira/browse/PARQUET-407) - Incorrect delta-encoding example
* [PARQUET-428](https://issues.apache.org/jira/browse/PARQUET-428) - Support INT96 and FIXED_LEN_BYTE_ARRAY types
* [PARQUET-601](https://issues.apache.org/jira/browse/PARQUET-601) - Add support in Parquet to configure the encoding used by ValueWriters
* [PARQUET-609](https://issues.apache.org/jira/browse/PARQUET-609) - Add Brotli compression to Parquet format
* [PARQUET-757](https://issues.apache.org/jira/browse/PARQUET-757) - Add NULL type to Bring Parquet logical types to par with Arrow
* [PARQUET-804](https://issues.apache.org/jira/browse/PARQUET-804) - parquet-format README.md still links to the old Google group
* [PARQUET-922](https://issues.apache.org/jira/browse/PARQUET-922) - Add index pages to the format to support efficient page skipping
* [PARQUET-1049](https://issues.apache.org/jira/browse/PARQUET-1049) - Make thrift version a property in pom.xml

#### Task

* [PARQUET-450](https://issues.apache.org/jira/browse/PARQUET-450) - Small typos/issues in parquet-format documentation
* [PARQUET-667](https://issues.apache.org/jira/browse/PARQUET-667) - Update committers lists to point to apache website
* [PARQUET-1124](https://issues.apache.org/jira/browse/PARQUET-1124) - Add new compression codecs to the Parquet spec
* [PARQUET-1125](https://issues.apache.org/jira/browse/PARQUET-1125) - Add UUID logical type

### Version 2.2.0 ###

* [PARQUET-23](https://issues.apache.org/jira/browse/PARQUET-23): Rename packages and maven coordinates to org.apache
Expand Down
74 changes: 49 additions & 25 deletions Encodings.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,53 +27,59 @@ This file contains the specification of all supported encodings.
Supported Types: all

This is the plain encoding that must be supported for types. It is
intended to be the simplest encoding. Values are encoded back to back.
intended to be the simplest encoding. Values are encoded back to back.

The plain encoding is used whenever a more efficient encoding can not be used. It
The plain encoding is used whenever a more efficient encoding can not be used. It
stores the data in the following format:
- BOOLEAN: [Bit Packed](#RLE), LSB first
- INT32: 4 bytes little endian
- INT64: 8 bytes little endian
- INT96: 12 bytes little endian
- INT96: 12 bytes little endian (deprecated)
- FLOAT: 4 bytes IEEE little endian
- DOUBLE: 8 bytes IEEE little endian
- BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
- FIXED_LEN_BYTE_ARRAY: the bytes contained in the array

For native types, this outputs the data as little endian. Floating
point types are encoded in IEEE.
point types are encoded in IEEE.

For the byte array type, it encodes the length as a 4 byte little
endian, followed by the bytes.

### Dictionary Encoding (PLAIN_DICTIONARY = 2)
The dictionary encoding builds a dictionary of values encountered in a given column. The
### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
The dictionary encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values are stored as integers
using the [RLE/Bit-Packing Hybrid](#RLE) encoding. If the dictionary grows too big, whether in size
or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is
or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is
written first, before the data pages of the column chunk.

Dictionary page format: the entries in the dictionary - in dictionary order - using the [plain](#PLAIN) encoding.

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),
followed by the values encoded using RLE/Bit packed described above (with the given bit width).

Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.

### <a name="RLE"></a>Run Length Encoding / Bit-Packing Hybrid (RLE = 3)

This encoding uses a combination of bit-packing and run length encoding to more efficiently store repeated values.

The grammar for this encoding looks like this, given a fixed bit-width known in advance:
```
rle-bit-packed-hybrid: <length> <encoded-data>
length := length of the <encoded-data> in bytes stored as 4 bytes little endian
length := length of the <encoded-data> in bytes stored as 4 bytes little endian (unsigned int32)
encoded-data := <run>*
run := <bit-packed-run> | <rle-run>
bit-packed-run := <bit-packed-header> <bit-packed-values>
bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)
// we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8
bit-pack-count := (number of values in this run) / 8
bit-packed-values := *see 1 below*
rle-run := <rle-header> <repeated-value>
rle-header := varint-encode( (number of times repeated) << 1)
run := <bit-packed-run> | <rle-run>
bit-packed-run := <bit-packed-header> <bit-packed-values>
bit-packed-header := varint-encode(<bit-pack-scaled-run-len> << 1 | 1)
// we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8
bit-pack-scaled-run-len := (bit-packed-run-len) / 8
bit-packed-run-len := *see 3 below*
bit-packed-values := *see 1 below*
rle-run := <rle-header> <repeated-value>
rle-header := varint-encode( (rle-run-len) << 1)
rle-run-len := *see 3 below*
repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)
```

Expand All @@ -82,14 +88,14 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex
though the order of the bits in each value remains in the usual order of most significant to least
significant. For example, to pack the same values as the example in the deprecated encoding above:

The numbers 1 through 7 using bit width 3:
The numbers 1 through 7 using bit width 3:
```
dec value: 0 1 2 3 4 5 6 7
bit value: 000 001 010 011 100 101 110 111
bit label: ABC DEF GHI JKL MNO PQR STU VWX
```
would be encoded like this where spaces mark byte boundaries (3 bytes):

would be encoded like this where spaces mark byte boundaries (3 bytes):
```
bit value: 10001000 11000110 11111010
bit label: HIDEFABC RMNOJKLG VWXSTUPQ
Expand All @@ -101,9 +107,24 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex
shifting and ORing with a mask. (to make this optimization work on a big-endian machine,
you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding)

2. varint-encode() is ULEB-128 encoding, see http://en.wikipedia.org/wiki/Variable-length_quantity
2. varint-encode() is ULEB-128 encoding, see https://en.wikipedia.org/wiki/LEB128

3. bit-packed-run-len and rle-run-len must be in the range \[1, 2<sup>31</sup> - 1\].
This means that a Parquet implementation can always store the run length in a signed
32-bit integer. This length restriction was not part of the Parquet 2.5.0 and earlier
specifications, but longer runs were not readable by the most common Parquet
implementations so, in practice, were not safe for Parquet writers to emit.


Note that the RLE encoding method is only supported for the following types of
data:

* Repetition and definition levels
* Dictionary indices
* Boolean values in data pages, as an alternative to PLAIN encoding

### <a name="BITPACKED"></a>Bit-packed (Deprecated) (BIT_PACKED = 4)

This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding.
Each value is encoded back to back using a fixed width.
There is no padding between values (except for the last byte) which is padded with 0s.
Expand All @@ -114,18 +135,21 @@ This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is
For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit,
which is not the same as the [RLE/bit-packing](#RLE) hybrid.

For example, the numbers 1 through 7 using bit width 3:
For example, the numbers 1 through 7 using bit width 3:
```
dec value: 0 1 2 3 4 5 6 7
bit value: 000 001 010 011 100 101 110 111
bit label: ABC DEF GHI JKL MNO PQR STU VWX
```
would be encoded like this where spaces mark byte boundaries (3 bytes):
would be encoded like this where spaces mark byte boundaries (3 bytes):
```
bit value: 00000101 00111001 01110111
bit label: ABCDEFGH IJKLMNOP QRSTUVWX
```

Note that the BIT_PACKED encoding method is only supported for encoding
repetition and definition levels.

### <a name="DELTAENC"></a>Delta Encoding (DELTA_BINARY_PACKED = 5)
Supported Types: INT32, INT64

Expand All @@ -141,7 +165,7 @@ The header is defined as follows:
* the total value count is stored as a VLQ int
* the first value is stored as a zigzag VLQ int

Each block contains
Each block contains
```
<min delta> <list of bitwidths of miniblocks> <miniblocks>
```
Expand Down Expand Up @@ -230,7 +254,7 @@ Supported Types: BYTE_ARRAY
This is also known as incremental encoding or front compression: for each element in a
sequence of strings, store the prefix length of the previous entry plus the suffix.

For a longer description, see http://en.wikipedia.org/wiki/Incremental_encoding.
For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding.

This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by
the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
Loading