apache · pitrou · Jan 3, 2023 · Dec 13, 2022
diff --git a/README.md b/README.md
@@ -239,10 +239,10 @@ skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and
 the reasoning behind adding these to the format.
 
 ## Checksumming
-Data pages can be individually checksummed.  This allows disabling of checksums at the
-HDFS file level, to better support single row lookups. Data page checksums are calculated
-using the standard CRC32 algorithm on the compressed data of a page (not including the
-page header itself).
+Pages of all kinds can be individually checksummed. This allows disabling of checksums
+at the HDFS file level, to better support single row lookups. Checksums are calculated
+using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary
+representation of a page (not including the page header itself).
 
 ## Error recovery
 If the file metadata is corrupt, the file is lost.  If the column metadata is corrupt,

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
@@ -639,32 +639,23 @@ struct PageHeader {
   /** Compressed (and potentially encrypted) page size in bytes, not including this header **/
   3: required i32 compressed_page_size
 
-  /** The 32bit CRC for the page, to be be calculated as follows:
-   * - Using the standard CRC32 algorithm
-   * - On the data only, i.e. this header should not be included. 'Data'
-   *   hereby refers to the concatenation of the repetition levels, the
-   *   definition levels and the column value, in this exact order.
-   * - On the encoded versions of the repetition levels, definition levels and
-   *   column values
-   * - On the compressed versions of the repetition levels, definition levels
-   *   and column values where possible;
-   *   - For v1 data pages, the repetition levels, definition levels and column
-   *     values are always compressed together. If a compression scheme is
-   *     specified, the CRC shall be calculated on the compressed version of
-   *     this concatenation. If no compression scheme is specified, the CRC
-   *     shall be calculated on the uncompressed version of this concatenation.
-   *   - For v2 data pages, the repetition levels and definition levels are
-   *     handled separately from the data and are never compressed (only
-   *     encoded). If a compression scheme is specified, the CRC shall be
-   *     calculated on the concatenation of the uncompressed repetition levels,
-   *     uncompressed definition levels and the compressed column values.
-   *     If no compression scheme is specified, the CRC shall be calculated on
-   *     the uncompressed concatenation.
-   * - In encrypted columns, CRC is calculated after page encryption; the
-   *   encryption itself is performed after page compression (if compressed)
+  /** The 32-bit CRC checksum for the page, to be be calculated as follows:
+   *
+   * - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7,
+   *   the same as in e.g. GZip).
+   * - All page types can have a CRC (v1 and v2 data pages, dictionary pages,
+   *   etc.).
+   * - The CRC is computed on the serialization binary representation of the page
+   *   (as written to disk), excluding the page header. For example, for v1
+   *   data pages, the CRC is computed on the concatenation of repetition levels,
+   *   definition levels and column values (optionally compressed, optionally
+   *   encrypted).
+   * - The CRC computation therefore takes place after any compression
+   *   and encryption steps, if any.
+   *
    * If enabled, this allows for disabling checksumming in HDFS if only a few
    * pages need to be read.
-   **/
+   */
   4: optional i32 crc
 
   // Headers for page specific data.  One only will be set.