-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader #8548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <[email protected]> Co-authored-by: Max Burke <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Max Burke <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust. If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed. Closes #7917 from nevi-me/ARROW-8243 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…arrow_schema with ipc changes Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!! Hi! 👋 I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804e, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again. I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with 8f0ed91 next time this branch gets rebased. Please let me know if I should change anything, I'm happy to! Closes #8274 from carols10cents/update-with-ipc-changes Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Closes #8388 from nevi-me/ARROW-10225 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This allows writing an Arrow NullArray to Parquet. Support was added a few years ago in Parquet, and the C++ implementation supports writing null arrays. The array is stored as an int32 which has all values set as null. In order to implement this, we introduce a `null -> int32` cast, which creates a null int32 of same length. Semantically, the write is the same as writing an int32 that's all null, but we create a null writer to preserve the data type. Closes #8484 from nevi-me/ARROW-10334 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes #8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (deliberately not supporting converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries) I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me ! Closes #8402 from carols10cents/dict Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Co-authored-by: Jake Goulding <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
ce06e17 to
3cc3a4b
Compare
|
There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer before being merged. This merge IMO did not follow our review practice and did not gave time to other committers to go through it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit lost here: I though that, we would still conduct a review of this code before a merge, as no-one was reviewing while the branch was up.
IMO we introduced errors to the arrow crate, and there was no opportunity for others to review before this merge.
| impl PartialEq for ArrayData { | ||
| fn eq(&self, other: &Self) -> bool { | ||
| assert_eq!( | ||
| self.data_type(), | ||
| other.data_type(), | ||
| "Data types not the same" | ||
| ); | ||
| assert_eq!(self.len(), other.len(), "Lengths not the same"); | ||
| // TODO: when adding tests for this, test that we can compare with arrays that have offsets | ||
| assert_eq!(self.offset(), other.offset(), "Offsets not the same"); | ||
| assert_eq!(self.null_count(), other.null_count()); | ||
| // compare buffers excluding padding | ||
| let self_buffers = self.buffers(); | ||
| let other_buffers = other.buffers(); | ||
| assert_eq!(self_buffers.len(), other_buffers.len()); | ||
| self_buffers.iter().zip(other_buffers).for_each(|(s, o)| { | ||
| compare_buffer_regions( | ||
| s, | ||
| self.offset(), // TODO mul by data length | ||
| o, | ||
| other.offset(), // TODO mul by data len | ||
| ); | ||
| }); | ||
| // assert_eq!(self.buffers(), other.buffers()); | ||
|
|
||
| assert_eq!(self.child_data(), other.child_data()); | ||
| // null arrays can skip the null bitmap, thus only compare if there are no nulls | ||
| if self.null_count() != 0 || other.null_count() != 0 { | ||
| compare_buffer_regions( | ||
| self.null_buffer().unwrap(), | ||
| self.offset(), | ||
| other.null_buffer().unwrap(), | ||
| other.offset(), | ||
| ) | ||
| } | ||
| true | ||
| } | ||
| } | ||
|
|
||
| /// A helper to compare buffer regions of 2 buffers. | ||
| /// Compares the length of the shorter buffer. | ||
| fn compare_buffer_regions( | ||
| left: &Buffer, | ||
| left_offset: usize, | ||
| right: &Buffer, | ||
| right_offset: usize, | ||
| ) { | ||
| // for convenience, we assume that the buffer lengths are only unequal if one has padding, | ||
| // so we take the shorter length so we can discard the padding from the longer length | ||
| let shorter_len = left.len().min(right.len()); | ||
| let s_sliced = left.bit_slice(left_offset, shorter_len); | ||
| let o_sliced = right.bit_slice(right_offset, shorter_len); | ||
| assert_eq!(s_sliced, o_sliced); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is not correct, see this comment
Apologies Jorge, The intention was to bring in the changes into master, on the rebuttable presumption that they've been reviewed already. After creating the PR, I noticed that the other merge options have been disabled in the UI (we can only squash now), so the aim of being able to use the UI to merge this was defeated; and so I rebased and pushed from my end. I've gone through your questions around arraydata equality, so I'll have a look at what you've done on #8541, and I can prioritise any follow-up work needed to cover what might still be outstanding. |
No worries, was a mis-communication, and I am sorry for not being very explicit about it: I was aware of the writer branch, but since it was moving fast, I deemed it as "WIP" and waited for a PR to master before going through it. Once I saw the PR from @carols10cents to master, I understood it to be the "ready to review" version of the branch, and reviewed it ASAP, given the clock ticking on parquet stuff and the call. Only afterwards I realize that that PR was actually a different thing, and that there has been a merge of the main changes already. I would consider beneficial to have a review on any push to master from anyone. If anything, it gives time to others to comment on the PR and get accustomed to the changes. |
|
That's okay, I understand. |
|
Anyways, I do not want this to downplay any of the work that mostly you and @carols10cents have been taking: the changes look really great. Thanks a lot for taking this on. It is a monumental effort and the outcome is great. 💯 Overall, this brings the rust implementation to its full potential of read and write to parquet. |
This merges the in-progress Arrow <> Parquet IO branch into master, so we can continue with development here.
Please do not squash, as we'd like to preserve the individual commits as they're already squashed into their respective JIRAs.