diff --git a/README.md b/README.md index c759be956..01193ae2b 100644 --- a/README.md +++ b/README.md @@ -195,7 +195,9 @@ the reasoning behind adding these to the format. ## Checksumming Data pages can be individually checksummed. This allows disabling of checksums at the -HDFS file level, to better support single row lookups. +HDFS file level, to better support single row lookups. Data page checksums are calculated +using the standard CRC32 algorithm on the compressed data of a page (not including the +page header itself). ## Error recovery If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 7a29b80f4..4272cc398 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -604,8 +604,29 @@ struct PageHeader { /** Compressed page size in bytes (not including this header) **/ 3: required i32 compressed_page_size - /** 32bit crc for the data below. This allows for disabling checksumming in HDFS - * if only a few pages needs to be read + /** The 32bit CRC for the page, to be be calculated as follows: + * - Using the standard CRC32 algorithm + * - On the data only, i.e. this header should not be included. 'Data' + * hereby refers to the concatenation of the repetition levels, the + * definition levels and the column value, in this exact order. + * - On the encoded versions of the repetition levels, definition levels and + * column values + * - On the compressed versions of the repetition levels, definition levels + * and column values where possible; + * - For v1 data pages, the repetition levels, definition levels and column + * values are always compressed together. If a compression scheme is + * specified, the CRC shall be calculated on the compressed version of + * this concatenation. If no compression scheme is specified, the CRC + * shall be calculated on the uncompressed version of this concatenation. + * - For v2 data pages, the repetition levels and definition levels are + * handled separately from the data and are never compressed (only + * encoded). If a compression scheme is specified, the CRC shall be + * calculated on the concatenation of the uncompressed repetition levels, + * uncompressed definition levels and the compressed column values. + * If no compression scheme is specified, the CRC shall be calculated on + * the uncompressed concatenation. + * If enabled, this allows for disabling checksumming in HDFS if only a few + * pages need to be read. **/ 4: optional i32 crc