Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,10 +239,10 @@ skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and
the reasoning behind adding these to the format.

## Checksumming
Data pages can be individually checksummed. This allows disabling of checksums at the
HDFS file level, to better support single row lookups. Data page checksums are calculated
using the standard CRC32 algorithm on the compressed data of a page (not including the
page header itself).
Pages of all kinds can be individually checksummed. This allows disabling of checksums
at the HDFS file level, to better support single row lookups. Checksums are calculated
using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary
representation of a page (not including the page header itself).

## Error recovery
If the file metadata is corrupt, the file is lost. If the column metadata is corrupt,
Expand Down
39 changes: 15 additions & 24 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -639,32 +639,23 @@ struct PageHeader {
/** Compressed (and potentially encrypted) page size in bytes, not including this header **/
3: required i32 compressed_page_size

/** The 32bit CRC for the page, to be be calculated as follows:
* - Using the standard CRC32 algorithm
* - On the data only, i.e. this header should not be included. 'Data'
* hereby refers to the concatenation of the repetition levels, the
* definition levels and the column value, in this exact order.
* - On the encoded versions of the repetition levels, definition levels and
* column values
* - On the compressed versions of the repetition levels, definition levels
* and column values where possible;
* - For v1 data pages, the repetition levels, definition levels and column
* values are always compressed together. If a compression scheme is
* specified, the CRC shall be calculated on the compressed version of
* this concatenation. If no compression scheme is specified, the CRC
* shall be calculated on the uncompressed version of this concatenation.
* - For v2 data pages, the repetition levels and definition levels are
* handled separately from the data and are never compressed (only
* encoded). If a compression scheme is specified, the CRC shall be
* calculated on the concatenation of the uncompressed repetition levels,
* uncompressed definition levels and the compressed column values.
* If no compression scheme is specified, the CRC shall be calculated on
* the uncompressed concatenation.
* - In encrypted columns, CRC is calculated after page encryption; the
* encryption itself is performed after page compression (if compressed)
/** The 32-bit CRC checksum for the page, to be be calculated as follows:
*
* - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7,
* the same as in e.g. GZip).
* - All page types can have a CRC (v1 and v2 data pages, dictionary pages,
* etc.).
* - The CRC is computed on the serialization binary representation of the page
* (as written to disk), excluding the page header. For example, for v1
* data pages, the CRC is computed on the concatenation of repetition levels,
* definition levels and column values (optionally compressed, optionally
* encrypted).
* - The CRC computation therefore takes place after any compression
* and encryption steps, if any.
*
* If enabled, this allows for disabling checksumming in HDFS if only a few
* pages need to be read.
**/
*/
4: optional i32 crc

// Headers for page specific data. One only will be set.
Expand Down