Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Oct 28, 2020

This merges the in-progress Arrow <> Parquet IO branch into master, so we can continue with development here.

Please do not squash, as we'd like to preserve the individual commits as they're already squashed into their respective JIRAs.

@github-actions
Copy link

nevi-me and others added 10 commits October 28, 2020 20:24
**Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct<d: double, e: struct<f: float>> not null
  child 0, d: double
  child 1, e: struct<f: float>
      child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from arrays
* Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Max Burke <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
…arrow_schema with ipc changes

Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!!

Hi! 👋 I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804e, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again.

I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with 8f0ed91 next time this branch gets rebased.

Please let me know if I should change anything, I'm happy to!

Closes #8274 from carols10cents/update-with-ipc-changes

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Closes #8388 from nevi-me/ARROW-10225

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This allows writing an Arrow NullArray to Parquet.
Support was added a few years ago in Parquet, and the C++ implementation supports writing null arrays.
The array is stored as an int32 which has all values set as null.
In order to implement this, we introduce a `null -> int32` cast, which creates a null int32 of same length.
Semantically, the write is the same as writing an int32 that's all null, but we create a null writer to preserve the data type.

Closes #8484 from nevi-me/ARROW-10334

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This is a port of #6770 to the parquet-writer branch.

We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types.

Closes #8449 from nevi-me/ARROW-7842-cherry

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
This adds more support for:

- When converting Arrow -> Parquet containing an Arrow Dictionary,
materialize the Dictionary values and send to Parquet to be encoded with
a dictionary or not according to the Parquet settings (deliberately not supporting
converting an Arrow Dictionary directly to Parquet DictEncoding, and right now this only supports String dictionaries)
- When converting Parquet -> Arrow, noticing that the Arrow schema
metadata in a Parquet file has a Dictionary type and converting the data
to an Arrow dictionary (right now this only supports String dictionaries)

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

Closes #8402 from carols10cents/dict

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Co-authored-by: Jake Goulding <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from ce06e17 to 3cc3a4b Compare October 28, 2020 18:58
@nevi-me nevi-me merged commit 3cc3a4b into master Oct 28, 2020
@nevi-me nevi-me deleted the rust-parquet-arrow-writer branch October 28, 2020 21:26
@jorgecarleitao
Copy link
Member

There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer before being merged. This merge IMO did not follow our review practice and did not gave time to other committers to go through it.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit lost here: I though that, we would still conduct a review of this code before a merge, as no-one was reviewing while the branch was up.

IMO we introduced errors to the arrow crate, and there was no opportunity for others to review before this merge.

Comment on lines +212 to +266
impl PartialEq for ArrayData {
fn eq(&self, other: &Self) -> bool {
assert_eq!(
self.data_type(),
other.data_type(),
"Data types not the same"
);
assert_eq!(self.len(), other.len(), "Lengths not the same");
// TODO: when adding tests for this, test that we can compare with arrays that have offsets
assert_eq!(self.offset(), other.offset(), "Offsets not the same");
assert_eq!(self.null_count(), other.null_count());
// compare buffers excluding padding
let self_buffers = self.buffers();
let other_buffers = other.buffers();
assert_eq!(self_buffers.len(), other_buffers.len());
self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
compare_buffer_regions(
s,
self.offset(), // TODO mul by data length
o,
other.offset(), // TODO mul by data len
);
});
// assert_eq!(self.buffers(), other.buffers());

assert_eq!(self.child_data(), other.child_data());
// null arrays can skip the null bitmap, thus only compare if there are no nulls
if self.null_count() != 0 || other.null_count() != 0 {
compare_buffer_regions(
self.null_buffer().unwrap(),
self.offset(),
other.null_buffer().unwrap(),
other.offset(),
)
}
true
}
}

/// A helper to compare buffer regions of 2 buffers.
/// Compares the length of the shorter buffer.
fn compare_buffer_regions(
left: &Buffer,
left_offset: usize,
right: &Buffer,
right_offset: usize,
) {
// for convenience, we assume that the buffer lengths are only unequal if one has padding,
// so we take the shorter length so we can discard the padding from the longer length
let shorter_len = left.len().min(right.len());
let s_sliced = left.bit_slice(left_offset, shorter_len);
let o_sliced = right.bit_slice(right_offset, shorter_len);
assert_eq!(s_sliced, o_sliced);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is not correct, see this comment

@nevi-me
Copy link
Contributor Author

nevi-me commented Oct 29, 2020

There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer before being merged. This merge IMO did not follow our review practice and did not gave time to other committers to go through it.

Apologies Jorge,

The intention was to bring in the changes into master, on the rebuttable presumption that they've been reviewed already.
We don't have a lot of review bandwidth, so I wasn't expecting anyone to review the commits again, as we also wouldn't easily change them (maybe except the last commit with a git commit --amend.

After creating the PR, I noticed that the other merge options have been disabled in the UI (we can only squash now), so the aim of being able to use the UI to merge this was defeated; and so I rebased and pushed from my end.

I've gone through your questions around arraydata equality, so I'll have a look at what you've done on #8541, and I can prioritise any follow-up work needed to cover what might still be outstanding.

@jorgecarleitao
Copy link
Member

There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer before being merged. This merge IMO did not follow our review practice and did not gave time to other committers to go through it.

Apologies Jorge,

The intention was to bring in the changes into master, on the rebuttable presumption that they've been reviewed already.
We don't have a lot of review bandwidth, so I wasn't expecting anyone to review the commits again, as we also wouldn't easily change them (maybe except the last commit with a git commit --amend.

After creating the PR, I noticed that the other merge options have been disabled in the UI (we can only squash now), so the aim of being able to use the UI to merge this was defeated; and so I rebased and pushed from my end.

I've gone through your questions around arraydata equality, so I'll have a look at what you've done on #8541, and I can prioritise any follow-up work needed to cover what might still be outstanding.

No worries, was a mis-communication, and I am sorry for not being very explicit about it: I was aware of the writer branch, but since it was moving fast, I deemed it as "WIP" and waited for a PR to master before going through it. Once I saw the PR from @carols10cents to master, I understood it to be the "ready to review" version of the branch, and reviewed it ASAP, given the clock ticking on parquet stuff and the call.

Only afterwards I realize that that PR was actually a different thing, and that there has been a merge of the main changes already.

I would consider beneficial to have a review on any push to master from anyone. If anything, it gives time to others to comment on the PR and get accustomed to the changes.

@nevi-me
Copy link
Contributor Author

nevi-me commented Oct 29, 2020

That's okay, I understand.
I suppose it's becoming less of an issue as we seem to be growing commiters and reviewers. We'll also be doing the rest of the parquet work on the main branch, so this won't happen again.

@jorgecarleitao
Copy link
Member

Anyways, I do not want this to downplay any of the work that mostly you and @carols10cents have been taking: the changes look really great. Thanks a lot for taking this on. It is a monumental effort and the outcome is great. 💯 Overall, this brings the rust implementation to its full potential of read and write to parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants