diff --git a/README.md b/README.md index 99b05468b..d0f654f7f 100644 --- a/README.md +++ b/README.md @@ -239,10 +239,10 @@ skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and the reasoning behind adding these to the format. ## Checksumming -Data pages can be individually checksummed. This allows disabling of checksums at the -HDFS file level, to better support single row lookups. Data page checksums are calculated -using the standard CRC32 algorithm on the compressed data of a page (not including the -page header itself). +Pages of all kinds can be individually checksummed. This allows disabling of checksums +at the HDFS file level, to better support single row lookups. Checksums are calculated +using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary +representation of a page (not including the page header itself). ## Error recovery If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 8c4ddd0cd..54beb4771 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -639,32 +639,23 @@ struct PageHeader { /** Compressed (and potentially encrypted) page size in bytes, not including this header **/ 3: required i32 compressed_page_size - /** The 32bit CRC for the page, to be be calculated as follows: - * - Using the standard CRC32 algorithm - * - On the data only, i.e. this header should not be included. 'Data' - * hereby refers to the concatenation of the repetition levels, the - * definition levels and the column value, in this exact order. - * - On the encoded versions of the repetition levels, definition levels and - * column values - * - On the compressed versions of the repetition levels, definition levels - * and column values where possible; - * - For v1 data pages, the repetition levels, definition levels and column - * values are always compressed together. If a compression scheme is - * specified, the CRC shall be calculated on the compressed version of - * this concatenation. If no compression scheme is specified, the CRC - * shall be calculated on the uncompressed version of this concatenation. - * - For v2 data pages, the repetition levels and definition levels are - * handled separately from the data and are never compressed (only - * encoded). If a compression scheme is specified, the CRC shall be - * calculated on the concatenation of the uncompressed repetition levels, - * uncompressed definition levels and the compressed column values. - * If no compression scheme is specified, the CRC shall be calculated on - * the uncompressed concatenation. - * - In encrypted columns, CRC is calculated after page encryption; the - * encryption itself is performed after page compression (if compressed) + /** The 32-bit CRC checksum for the page, to be be calculated as follows: + * + * - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7, + * the same as in e.g. GZip). + * - All page types can have a CRC (v1 and v2 data pages, dictionary pages, + * etc.). + * - The CRC is computed on the serialization binary representation of the page + * (as written to disk), excluding the page header. For example, for v1 + * data pages, the CRC is computed on the concatenation of repetition levels, + * definition levels and column values (optionally compressed, optionally + * encrypted). + * - The CRC computation therefore takes place after any compression + * and encryption steps, if any. + * * If enabled, this allows for disabling checksumming in HDFS if only a few * pages need to be read. - **/ + */ 4: optional i32 crc // Headers for page specific data. One only will be set.