PARQUET-1539: Clarify CRC checksum in page header #126

bbraams · 2019-02-24T21:21:34Z

Although a page-level CRC field is defined in the Thrift specification, currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the comment in the Thrift specification reads ‘32bit crc for the data below’, which is somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum should be calculated on. To ensure backward- and cross-compatibility of Parquet readers/writes which do want to leverage the CRC checksums, the format should specify exactly how and on what data the checksum should be calculated.

TLDR proposal: The checksum will be calculated using the standard CRC32 algorithm, whereby the checksum is to be calculated on the data only, not including the page header itself (simple implementation) and the checksum will be calculated on compressed data (inherently faster, likely better triaging).

Alternatives

There are three main choices to be made here:

Which variant of CRC32 to use
Whether to include the page header itself in the checksum calculation
Whether to calculate the checksum on uncompressed or compressed data

Algorithm

The CRC field holds a 32-bit value. There are many different variants of the original CRC32 algorithm, each producing different values for the same input. For ease of implementation we propose to use the standard CRC32 algorithm.

Including page header

The page header itself could be included in the checksum calculation using an approach similar to what TCP does, whereby the checksum field itself is zeroed out before calculating the checksum that will be inserted there. Evidently, including the page header is better in the sense that it increases the data covered by the checksum. However, from an implementation perspective, not including it is likely easier. Furthermore, given the relatively small size of the page header compared to the page itself, simply not including it will likely be good enough.

Compressed vs uncompressed

Compressed
Pros

Inherently faster, less data to operate on
Potentially better triaging when determining where a corruption may have been introduced, as checksum is calculated in a later stage

Cons

We have to trust both the encoding stage and the compression stage

Uncompressed
Pros

We only have to trust the encoding stage
Possibly able to detect more corruptions, as data is checksummed at earliest possible moment, checksum will be more sensitive to corruption introduced further down the line

Cons

Inherently slower, more data to operate on, always need to decompress first
Potentially harder triaging, more stages in which corruption could have been introduced

src/main/thrift/parquet.thrift

pitrou · 2022-12-13T09:50:55Z

@bbraams @gszadovszky Could you explain why the spec's wording is so complex?

It seems to me that the CRC is basically computed over the entire serialized data exactly as it's written to disk (after optional compression and encryption, and including the rep/def levels area that's prepended to the actual data). But the wording makes it seem that special care is needed to accumulate the CRC over different pieces of data, which may scare implementors (context: apache/arrow#14351 ).

Am I right in my interpretation above?

(also cc @mapleFU, who's working on CRC support for Parquet C++)

mapleFU · 2022-12-13T11:40:12Z

(also cc @mapleFU, who's working on CRC support for Parquet C++)

Hi, all, I have a question here, the format says:

  /** The 32bit CRC for the page, to be be calculated as follows:
   * - Using the standard CRC32 algorithm
   * - On the data only, i.e. this header should not be included. 'Data'
   *   hereby refers to the concatenation of the repetition levels, the
   *   definition levels and the column value, in this exact order.
   * - On the encoded versions of the repetition levels, definition levels and
   *   column values
   * - On the compressed versions of the repetition levels, definition levels
   *   and column values where possible;
   *   - For v1 data pages, the repetition levels, definition levels and column
   *     values are always compressed together. If a compression scheme is
   *     specified, the CRC shall be calculated on the compressed version of
   *     this concatenation. If no compression scheme is specified, the CRC
   *     shall be calculated on the uncompressed version of this concatenation.
   *   - For v2 data pages, the repetition levels and definition levels are
   *     handled separately from the data and are never compressed (only
   *     encoded). If a compression scheme is specified, the CRC shall be
   *     calculated on the concatenation of the uncompressed repetition levels,
   *     uncompressed definition levels and the compressed column values.
   *     If no compression scheme is specified, the CRC shall be calculated on
   *     the uncompressed concatenation.
   * - In encrypted columns, CRC is calculated after page encryption; the
   *   encryption itself is performed after page compression (if compressed)
   * If enabled, this allows for disabling checksumming in HDFS if only a few
   * pages need to be read.
   **/

and in README:

Data pages can be individually checksummed.

But in our coding, we have:

int64_t WriteDictionaryPage(const DictionaryPage& page) override {
    // TODO(PARQUET-594) crc checksum
    ...
}

So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou

pitrou · 2022-12-13T12:01:45Z

It does seem that parquet-mr writes a CRC value for dictionary pages...
https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L499-L510

mapleFU · 2022-12-13T12:12:49Z

So, should we update the parquet-format, or just keep it here and not implement crc in parquet c++ version? @pitrou

pitrou · 2022-12-13T12:24:10Z

It seems it was done deliberately in parquet-mr and all Parquet committers there agreed that it was how the spec should be interpreted: apache/parquet-java#647

So I would vote for doing it in Parquet C++. Can you first generate test files for CRCs of dictionary pages, similar to what you did for data pages?

pitrou · 2022-12-13T12:24:57Z

And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something.

mapleFU · 2022-12-13T12:27:21Z

And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something.

OK, thanks for your patient. I updated the descriptions in https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in the coming 2 patches, let's keep the patch simple and finish apache/arrow#14351 first

wgtmac · 2022-12-13T14:21:19Z

And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something.

Quick question: is there any rule to sync the parquet.thrift file from apache/parquet-format to apache/arrow? @pitrou

pitrou · 2022-12-13T14:22:23Z

@wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes.

When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout). Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages). Also, see discussion on apache#126 (comment) and below.

pitrou · 2022-12-13T14:44:45Z

I opened #188 to clarify the wording.

When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout). Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages). Also, see discussion on #126 (comment) and below.

PARQUET-1539: Clarify CRC checksum in page header

ac88916

gszadovszky requested changes Feb 25, 2019

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

Further clarification

adb6d5d

gszadovszky approved these changes Mar 4, 2019

View reviewed changes

gszadovszky merged commit db23fe3 into apache:master Mar 5, 2019

bbraams mentioned this pull request Jun 11, 2019

PARQUET-1580: Page-level CRC checksum verfication for DataPageV1 apache/parquet-java#647

Merged

1 task

mapleFU mentioned this pull request Dec 13, 2022

GH-33115: [C++] Parquet Implement crc in reading and writing Page for DATA_PAGE (v1) apache/arrow#14351

Merged

5 tasks

pitrou mentioned this pull request Dec 13, 2022

PARQUET-2218: [Format] Clarify CRC computation #188

Merged

This was referenced Jun 23, 2024

Clarify CRC checksum in page header #364

Closed

[Format] Clarify CRC computation #402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-1539: Clarify CRC checksum in page header #126

PARQUET-1539: Clarify CRC checksum in page header #126

Uh oh!

bbraams commented Feb 24, 2019

Uh oh!

Uh oh!

pitrou commented Dec 13, 2022 •

edited

Loading

Uh oh!

mapleFU commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

mapleFU commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022 •

edited

Loading

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

mapleFU commented Dec 13, 2022

Uh oh!

wgtmac commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PARQUET-1539: Clarify CRC checksum in page header #126

PARQUET-1539: Clarify CRC checksum in page header #126

Uh oh!

Conversation

bbraams commented Feb 24, 2019

Alternatives

Algorithm

Including page header

Compressed vs uncompressed

Uh oh!

Uh oh!

pitrou commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

mapleFU commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

mapleFU commented Dec 13, 2022

Uh oh!

wgtmac commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

pitrou commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pitrou commented Dec 13, 2022 •

edited

Loading

pitrou commented Dec 13, 2022 •

edited

Loading