-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-1539: Clarify CRC checksum in page header #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bbraams @gszadovszky Could you explain why the spec's wording is so complex? It seems to me that the CRC is basically computed over the entire serialized data exactly as it's written to disk (after optional compression and encryption, and including the rep/def levels area that's prepended to the actual data). But the wording makes it seem that special care is needed to accumulate the CRC over different pieces of data, which may scare implementors (context: apache/arrow#14351 ). Am I right in my interpretation above? (also cc @mapleFU, who's working on CRC support for Parquet C++) |
Hi, all, I have a question here, the format says: and in But in our coding, we have: int64_t WriteDictionaryPage(const DictionaryPage& page) override {
// TODO(PARQUET-594) crc checksum
...
}So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou |
|
It does seem that parquet-mr writes a CRC value for dictionary pages... |
|
So, should we update the |
|
It seems it was done deliberately in parquet-mr and all Parquet committers there agreed that it was how the spec should be interpreted: apache/parquet-java#647 So I would vote for doing it in Parquet C++. Can you first generate test files for CRCs of dictionary pages, similar to what you did for data pages? |
|
And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. |
OK, thanks for your patient. I updated the descriptions in https://issues.apache.org/jira/browse/ARROW-17904 . And I will finish them in the coming 2 patches, let's keep the patch simple and finish apache/arrow#14351 first |
Quick question: is there any rule to sync the |
|
@wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes. |
When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout). Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages). Also, see discussion on apache#126 (comment) and below.
|
I opened #188 to clarify the wording. |
When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout). Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages). Also, see discussion on #126 (comment) and below.
Although a page-level CRC field is defined in the Thrift specification, currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the comment in the Thrift specification reads ‘32bit crc for the data below’, which is somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum should be calculated on. To ensure backward- and cross-compatibility of Parquet readers/writes which do want to leverage the CRC checksums, the format should specify exactly how and on what data the checksum should be calculated.
TLDR proposal: The checksum will be calculated using the standard CRC32 algorithm, whereby the checksum is to be calculated on the data only, not including the page header itself (simple implementation) and the checksum will be calculated on compressed data (inherently faster, likely better triaging).
Alternatives
There are three main choices to be made here:
Algorithm
The CRC field holds a 32-bit value. There are many different variants of the original CRC32 algorithm, each producing different values for the same input. For ease of implementation we propose to use the standard CRC32 algorithm.
Including page header
The page header itself could be included in the checksum calculation using an approach similar to what TCP does, whereby the checksum field itself is zeroed out before calculating the checksum that will be inserted there. Evidently, including the page header is better in the sense that it increases the data covered by the checksum. However, from an implementation perspective, not including it is likely easier. Furthermore, given the relatively small size of the page header compared to the page itself, simply not including it will likely be good enough.
Compressed vs uncompressed
Compressed
Pros
Cons
Uncompressed
Pros
Cons