-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-33115: [C++] Parquet Implement crc in reading and writing Page for DATA_PAGE (v1) #14351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Thanks for the PR @mapleFU . I haven't looked in detail yet. IMHO The testing approach should be to add reference files in https://github.com/apache/parquet-testing/tree/master/data .
This would probably be spread over two files, one with correct CRCs, one with incorrect CRCs. |
This actually seems more complicated to implement (you have to replicate details of the file format in the tests). |
Sure, let me generate some data files in parquet-mr following |
|
Since the code is simple and the algorithm well-know, my vote is to own it and keep it in |
|
I like |
|
@mapleFU Is this still WIP? |
# Conflicts: # docs/source/cpp/parquet.rst
|
Some comment fixed, test would be added later |
|
As I said before:
|
|
And, yes, one or two test vectors is enough. But make sure to include bytes >= 128 (to check for signedness issues). |
|
https://github.com/python/cpython/blob/3ceb4b8d3ad48101e67da8fb8f581d3377863bfa/Lib/test/test_binascii.py#L237-L242 |
5d71c7d to
300b680
Compare
|
@pitrou I decide to add boost crc32 for testing, though we don't introduce it in our code, but I think using it to test is a goot idea. |
Seems windows cannot use random uint8_t... |
2f167bb to
17e13bd
Compare
|
Oh, seems #34038 cause the error... Let's waiting it to be merged... |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. We use boost for testing in other places, so I think that's reasonable.
This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.
parquet-mrAnd there is some questions, I found that in thirdparty, arrow imports
crc32c, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .The default config of
enable crcin parquet-mr for writer is true, but here I usefalse, because set it true may slow down writer.