Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,9 @@ the reasoning behind adding these to the format.

## Checksumming
Data pages can be individually checksummed. This allows disabling of checksums at the
HDFS file level, to better support single row lookups.
HDFS file level, to better support single row lookups. Data page checksums are calculated
using the standard CRC32 algorithm on the compressed data of a page (not including the
page header itself).

## Error recovery
If the file metadata is corrupt, the file is lost. If the column metadata is corrupt,
Expand Down
25 changes: 23 additions & 2 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -604,8 +604,29 @@ struct PageHeader {
/** Compressed page size in bytes (not including this header) **/
3: required i32 compressed_page_size

/** 32bit crc for the data below. This allows for disabling checksumming in HDFS
* if only a few pages needs to be read
/** The 32bit CRC for the page, to be be calculated as follows:
* - Using the standard CRC32 algorithm
* - On the data only, i.e. this header should not be included. 'Data'
* hereby refers to the concatenation of the repetition levels, the
* definition levels and the column value, in this exact order.
* - On the encoded versions of the repetition levels, definition levels and
* column values
* - On the compressed versions of the repetition levels, definition levels
* and column values where possible;
* - For v1 data pages, the repetition levels, definition levels and column
* values are always compressed together. If a compression scheme is
* specified, the CRC shall be calculated on the compressed version of
* this concatenation. If no compression scheme is specified, the CRC
* shall be calculated on the uncompressed version of this concatenation.
* - For v2 data pages, the repetition levels and definition levels are
* handled separately from the data and are never compressed (only
* encoded). If a compression scheme is specified, the CRC shall be
* calculated on the concatenation of the uncompressed repetition levels,
* uncompressed definition levels and the compressed column values.
* If no compression scheme is specified, the CRC shall be calculated on
* the uncompressed concatenation.
* If enabled, this allows for disabling checksumming in HDFS if only a few
* pages need to be read.
**/
4: optional i32 crc

Expand Down