Skip to content

Conversation

@bbraams
Copy link
Contributor

@bbraams bbraams commented Feb 24, 2019

Although a page-level CRC field is defined in the Thrift specification, currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the comment in the Thrift specification reads ‘32bit crc for the data below’, which is somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum should be calculated on. To ensure backward- and cross-compatibility of Parquet readers/writes which do want to leverage the CRC checksums, the format should specify exactly how and on what data the checksum should be calculated.

TLDR proposal: The checksum will be calculated using the standard CRC32 algorithm, whereby the checksum is to be calculated on the data only, not including the page header itself (simple implementation) and the checksum will be calculated on compressed data (inherently faster, likely better triaging).

Alternatives

There are three main choices to be made here:

  1. Which variant of CRC32 to use
  2. Whether to include the page header itself in the checksum calculation
  3. Whether to calculate the checksum on uncompressed or compressed data

Algorithm

The CRC field holds a 32-bit value. There are many different variants of the original CRC32 algorithm, each producing different values for the same input. For ease of implementation we propose to use the standard CRC32 algorithm.

Including page header

The page header itself could be included in the checksum calculation using an approach similar to what TCP does, whereby the checksum field itself is zeroed out before calculating the checksum that will be inserted there. Evidently, including the page header is better in the sense that it increases the data covered by the checksum. However, from an implementation perspective, not including it is likely easier. Furthermore, given the relatively small size of the page header compared to the page itself, simply not including it will likely be good enough.

Compressed vs uncompressed

Compressed
Pros

  • Inherently faster, less data to operate on
  • Potentially better triaging when determining where a corruption may have been introduced, as checksum is calculated in a later stage

Cons

  • We have to trust both the encoding stage and the compression stage

Uncompressed
Pros

  • We only have to trust the encoding stage
  • Possibly able to detect more corruptions, as data is checksummed at earliest possible moment, checksum will be more sensitive to corruption introduced further down the line

Cons

  • Inherently slower, more data to operate on, always need to decompress first
  • Potentially harder triaging, more stages in which corruption could have been introduced

@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

@bbraams @gszadovszky Could you explain why the spec's wording is so complex?

It seems to me that the CRC is basically computed over the entire serialized data exactly as it's written to disk (after optional compression and encryption, and including the rep/def levels area that's prepended to the actual data). But the wording makes it seem that special care is needed to accumulate the CRC over different pieces of data, which may scare implementors (context: apache/arrow#14351 ).

Am I right in my interpretation above?

(also cc @mapleFU, who's working on CRC support for Parquet C++)

@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2022

(also cc @mapleFU, who's working on CRC support for Parquet C++)

Hi, all, I have a question here, the format says:

  /** The 32bit CRC for the page, to be be calculated as follows:
   * - Using the standard CRC32 algorithm
   * - On the data only, i.e. this header should not be included. 'Data'
   *   hereby refers to the concatenation of the repetition levels, the
   *   definition levels and the column value, in this exact order.
   * - On the encoded versions of the repetition levels, definition levels and
   *   column values
   * - On the compressed versions of the repetition levels, definition levels
   *   and column values where possible;
   *   - For v1 data pages, the repetition levels, definition levels and column
   *     values are always compressed together. If a compression scheme is
   *     specified, the CRC shall be calculated on the compressed version of
   *     this concatenation. If no compression scheme is specified, the CRC
   *     shall be calculated on the uncompressed version of this concatenation.
   *   - For v2 data pages, the repetition levels and definition levels are
   *     handled separately from the data and are never compressed (only
   *     encoded). If a compression scheme is specified, the CRC shall be
   *     calculated on the concatenation of the uncompressed repetition levels,
   *     uncompressed definition levels and the compressed column values.
   *     If no compression scheme is specified, the CRC shall be calculated on
   *     the uncompressed concatenation.
   * - In encrypted columns, CRC is calculated after page encryption; the
   *   encryption itself is performed after page compression (if compressed)
   * If enabled, this allows for disabling checksumming in HDFS if only a few
   * pages need to be read.
   **/

and in README:

Data pages can be individually checksummed. 

But in our coding, we have:

int64_t WriteDictionaryPage(const DictionaryPage& page) override {
    // TODO(PARQUET-594) crc checksum
    ...
}

So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou

@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2022

So, should we update the parquet-format, or just keep it here and not implement crc in parquet c++ version? @pitrou

@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

It seems it was done deliberately in parquet-mr and all Parquet committers there agreed that it was how the spec should be interpreted: apache/parquet-java#647

So I would vote for doing it in Parquet C++. Can you first generate test files for CRCs of dictionary pages, similar to what you did for data pages?

@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something.

@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2022

And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something.

OK, thanks for your patient. I updated the descriptions in https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in the coming 2 patches, let's keep the patch simple and finish apache/arrow#14351 first

@wgtmac
Copy link
Member

wgtmac commented Dec 13, 2022

And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something.

Quick question: is there any rule to sync the parquet.thrift file from apache/parquet-format to apache/arrow? @pitrou

@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

@wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes.

pitrou added a commit to pitrou/parquet-format that referenced this pull request Dec 13, 2022
When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous.

Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout).

Also, clarify that CRC computation can apply to all page kinds, not only data pages
(for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages).

Also, see discussion on apache#126 (comment) and below.
@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

I opened #188 to clarify the wording.

pitrou added a commit that referenced this pull request Jan 3, 2023
When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous.

Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout).

Also, clarify that CRC computation can apply to all page kinds, not only data pages
(for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages).

Also, see discussion on #126 (comment) and below.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants