Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 8, 2022

This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.

  • Implement crc in writing DATA_PAGE
  • Implement crc in reading DATA_PAGE
  • Adding config for write crc page and checking
  • Testing DATA_PAGE with crc, the testing maybe borrowed from parquet-mr
  • Using crc library in https://issues.apache.org/jira/browse/ARROW-17904

And there is some questions, I found that in thirdparty, arrow imports crc32c, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .

The default config of enable crc in parquet-mr for writer is true, but here I use false, because set it true may slow down writer.

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@pitrou
Copy link
Member

pitrou commented Oct 22, 2022

Thanks for the PR @mapleFU . I haven't looked in detail yet.

IMHO The testing approach should be to add reference files in https://github.com/apache/parquet-testing/tree/master/data .
Ideally these files should provide the given feature matrix:

  • correct and incorrect CRCs
  • v1 and v2 data pages
  • compressed and uncompressed columns

This would probably be spread over two files, one with correct CRCs, one with incorrect CRCs.
What do you think?

@mapleFU
Copy link
Member Author

mapleFU commented Oct 22, 2022

@pitrou
Copy link
Member

pitrou commented Oct 22, 2022

I'd like to follow the testing here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataPageV1Checksums.java

This actually seems more complicated to implement (you have to replicate details of the file format in the tests).
Also, without reference files we wouldn't test that different implementations actually agree on the CRC computation.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 22, 2022

This actually seems more complicated to implement (you have to replicate details of the file format in the tests). Also, without reference files we wouldn't test that different implementations actually agree on the CRC computation.

Sure, let me generate some data files in parquet-mr following TestDataPageV1Checksums first.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 28, 2022

@pitrou @kou Hi, so should I put crc to vendored or util?

@pitrou
Copy link
Member

pitrou commented Oct 29, 2022

Since the code is simple and the algorithm well-know, my vote is to own it and keep it in util.

@kou
Copy link
Member

kou commented Oct 29, 2022

I like cpp/src/arrow/vendored/ but it's not a strong opinion. So I'm OK with util.

@pitrou
Copy link
Member

pitrou commented Dec 1, 2022

@mapleFU Is this still WIP?

@mapleFU
Copy link
Member Author

mapleFU commented Feb 2, 2023

Some comment fixed, test would be added later

@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2023

I add a crc32 test which crc "123456789". But I found it hard to find some authority crc tests:

  1. boost crc only crc "123456789"
  2. folly crc uses boost crc to do some cross verification

So, I wonder where should I copy more crc tests? Or just test "123456789" is enough? @pitrou @kou

@pitrou
Copy link
Member

pitrou commented Feb 5, 2023

As I said before:

Also, we need unit tests for the CRC32 functionality using well-known test vectors (or generated from a well-known implementation, such as Python's).

@pitrou
Copy link
Member

pitrou commented Feb 5, 2023

And, yes, one or two test vectors is enough. But make sure to include bytes >= 128 (to check for signedness issues).

@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2023

@mapleFU mapleFU force-pushed the crc32-for-data-page branch from 5d71c7d to 300b680 Compare February 5, 2023 13:46
@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2023

@pitrou I decide to add boost crc32 for testing, though we don't introduce it in our code, but I think using it to test is a goot idea.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2023

C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include\random(1839): error C2338: invalid template argument for uniform_int_distribution: N4659 29.6.1.1 [rand.req.genl]/1e requires one of short, int, long, long long, unsigned short, unsigned int, unsigned long, or unsigned long long
D:/a/arrow/arrow/cpp/src/arrow/util/crc32_test.cc(55): note: see reference to class template instantiation 'std::uniform_int_distribution<uint8_t>' being compiled
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include\random(1839): error C2338: note: char, signed char, unsigned char, char8_t, int8_t, and uint8_t are not allowed

Seems windows cannot use random uint8_t...

@mapleFU mapleFU force-pushed the crc32-for-data-page branch from 2f167bb to 17e13bd Compare February 5, 2023 15:00
@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2023

Oh, seems #34038 cause the error... Let's waiting it to be merged...
@pitrou @wjones127 and comments are fixed, PTAL again

@mapleFU mapleFU requested review from pitrou and wjones127 and removed request for pitrou and wjones127 February 7, 2023 10:41
@wjones127 wjones127 self-requested a review February 8, 2023 18:35
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. We use boost for testing in other places, so I think that's reasonable.

@wjones127 wjones127 merged commit 5a61849 into apache:master Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Parquet support read page with crc32 checking

7 participants