-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7842: [Rust] [Parquet] implement array_reader for list type columns #6770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dc2c861 to
8c8efa1
Compare
|
@andygrove @nevi-me do you have any thoughts on this? |
| \n TableScan: employee.csv projection=Some([0, 3])"; | ||
| \n Selection: #state Eq Utf8(\"CO\")\ | ||
| \n TableScan: employee.csv projection=Some([0, 3])"; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these formatting changes from running rustfmt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
rust/datafusion/src/utils.rs
Outdated
| let list = $column | ||
| .as_any() | ||
| .downcast_ref::<array::ListArray>() | ||
| .unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect would be better than unwrap. Using a Result would be better than expect.
rust/datafusion/src/utils.rs
Outdated
| Ok(values) => values, | ||
| _ => { | ||
| return Err(ExecutionError::ExecutionError(format!( | ||
| "Unsupported {:?} type for repl.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does repl mean here? Maybe this error could be a bit more descriptive?
| DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => { | ||
| make_string!(array::Time64NanosecondArray, column, row) | ||
| } | ||
| DataType::List(_) => make_string_from_list!(column, row), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter what data type the list is? Can lists of any type be converted into strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recursively calls array_value_to_string on the array items, so an unsupported type should give the appropriate error on the recursive call
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. I wasn't aware of this PR until you mentioned me. My email filters didn't match the title ... this looks great overall. I left a few comments and would like to see some feedbackfrom someone more familiar with the parquet crate. I've requested some other reviewers take a look. Thanks for the contribution!
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| ArrowType::Time64(ArrowTimeUnit::Nanosecond) => { | ||
| build_empty_list_array_with_primitive_items!(ArrowTime64NanosecondType) | ||
| } | ||
| ArrowType::Duration(ArrowTimeUnit::Second) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are durations supported in parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the PrimitiveArrayReader implementation supports Duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably incorrect, but we can clarify/fix the behaviour of the readers at a later stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably incorrect, but we can clarify/fix the behaviour of the readers at a later stage.
|
Is there anything else that needs to be taken care of in order to have this merged? |
We need to wait for apache/parquet-testing#11 to be merged, then update the submodule so that the tests pass. @sunchao @paddyhoran may you please also have a look at this when you get a chance. I have some slight concerns around Arrow <-> Parquet format support, but I'm comfortable that we'll be able to pick up any required changes when we progress with the Arrow to Parquet writer. |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for the late review. Have being out of touch for some time and is taking time to get back to the latest state of Arrow reader in parquet-rs. Left some comments but will probably add more later.
|
|
||
| /// Build array reader for list type. | ||
| /// Currently this is not supported. | ||
| fn visit_list_with_item( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being called anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }}; | ||
| } | ||
|
|
||
| fn build_empty_list_array(item_type: ArrowType) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is implemented isn't ideal as it has to do pattern matching on every possible type. Is there a way to simplify this? For instance, every ArrowPrimitiveType already has a DataType as its field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether there's a way to do the association in the opposite direction. In this case we want to go from a DataType to an ArrowPrimitiveType. I found that pattern matching is currently used in other similar situations like https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L433 for example
| &self.data_type | ||
| } | ||
|
|
||
| fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle nested lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this PR doesn't handle nested lists. At the end of visit_list_with_item there's a check on the item type and it gives an error if the List item type is List https://github.com/apache/arrow/pull/6770/files#diff-fca053527b552a9f2abb1e280e570cb6R1197
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It will be good if we have a path to get there. Can also reference the C++ impl. Also could you add some comments in this to indicate that nested lists are not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added a couple comments -- on the ArrayReader implementation and on visit_list_with_item
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcassels thanks for your patience and I added more comments on this PR.
| && (rep_levels.len() == next_batch_array.len())) | ||
| { | ||
| return Err(ArrowError( | ||
| "Expected item_reader def_level and rep_level arrays to have the same length as batch array".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: long line
| if next_batch_array.len() == 0 { | ||
| return build_empty_list_array(item_type); | ||
| } | ||
| let def_levels = self.item_reader.get_def_levels().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
propagate possible errors instead of unwrap?
| let mut new_context = context.clone(); | ||
|
|
||
| let list_child = &list_type.get_fields()[0]; | ||
| let item_child = &list_child.get_fields()[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look correct. I think we should use the _item_type passed in for item type. List type in Parquet can be represented with several different kinds of schema as mentioned here. This method already helped to setup the list item type for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes more sense. I'm not sure why I did it that way before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failing test made me realize why I had it that way. Currently item_type is passed as &Type to visit_list_with_item while list_type is passed as Rc<Type>. Because item_type is not passed as Rc<Type>, if item_type is used then when the list items are visited by PrimitiveArrayReader, they will not satisfy the self.is_included check, because that checks using the reference.
I've now changed the signature of visit_list_with_item to pass item_type as Rc<Type> to fix this issue.
| &self.data_type | ||
| } | ||
|
|
||
| fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It will be good if we have a path to get there. Can also reference the C++ impl. Also could you add some comments in this to indicate that nested lists are not supported?
|
|
||
| // null list has def_level = 0 | ||
| // empty list has def_level = 1 | ||
| // null item in a list has def_level = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to check if the list item is required or optional right? the list itself can also be either optional or requried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it seems that this comment is not correct... I think it would be correct just in the case where both the list and the list item are optional.
However, much of the logic in this comment is not actually used: Rep and def levels are compared only with 0. Do you see any flaws in the def and rep level logic that comes after this comment? If the actual logic is ok I'll just remove the comment.
| } | ||
| let batch_values = match null_list_indices.len() { | ||
| 0 => next_batch_array.clone(), | ||
| _ => remove_indices(next_batch_array.clone(), item_type, null_list_indices)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why you need to process the batch array and add them through a builder? why cannot just directly use the array in the result ListArray and set up validity bits and offsets only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the null values that represent a null list rather than a null list item need to not be present in the child data of the final ListArray. I don't like that this uses the builder either, but I wasn't sure how else to create that child data array that has specific indices excluded... I might be misunderstanding something though. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand you correctly. I think a null list is represented with the combination of valid bit array and offset array, and should have no presence in the value array (example here). Therefore, the input array can just be passed intact to the ListArray as value array.
It would also be useful to refer the C++ impl.
|
@nevi-me did your other PRs supercede this? |
|
This has sadly fallen behind significantly from the main branch. I've spent a few hours trying to rebase, but because it looks like the PR contained changes for the below, it's a bit difficult to follow.
I got as far as getting everything except the filter kernel working, but I suspect I'll be spending a few more hours dealing with merge conflicts; so I'm throwing in the towel for now. I'll try a different strategy of reimplementing the parquet changes on the main branch separately. @yordan-pavlov would you have capacity to help with filter kernels for lists? We would use @mcassels work, but convert it to use the |
I hadn't seen this @emkornfield. No, I've only been focusing on the writer, so I'll still want to merge this into the branch currently called master. |
|
@nevi-me I am happy to help but I couldn't see any changes to the filter kernel in this PR; have I misunderstood your message? |
Apologies, this PR came from a fork that UrbanLogiq maintain, so I think they had worked on list filters for their purposes, and then initially included those commits in this PR, but eventually reversed them. I think it's still good to incorporate those changes, as the relevant JIRA has been open for a while (ARROW-5350). I've extracted the changes and pushed them to my fork at nevi-me@4c11f46. You'll see that I commented out the filter on lists, as I couldn't figure out how to make them work with There's already some tests, so looks like a solid base to start from. You can copy the changes from my commit, or push into my ARROW-5350 branch; either's fine with me. |
|
thanks @nevi-me , I understand now, will have a look this week |
Yes we had worked on list filters as well but intended to make that a separate PR. But as you can see unfortunately we got busy and fell behind main :/ |
|
@nevi-me @yordan-pavlov Hi so I joined UrbanLogiq recently and am helping to maintain this stuff. Turns out this has become important to fix and merge for us too and we need to bring our branch up to date. Let us know best way to go forward, but one of us is happy to start porting this change to a newer updated branch and let everyone have a look, and push a formal Arrow PR as well. |
I've already ported the changes to a separate branch, and I'm currently going through @sunchao's review. I ran out of time over the weekend, but I might be able to complete the changes in the coming week. I know it creates a dependency on me, but given the hours I've already spent on this; I'd rather still continue and address the review, as I might be able to merge the new PR immediately if I don't have questions for anyone. |
@nevi-me that sounds great, looking forward to it! There is another feature we might look to push from our branch sometime. :) |
|
Closed in favour of #8449 |
This is a port of apache#6770 to the parquet-writer branch. We'll have more of a chance to test this reader, and ensure that we can roundtrip on list types.
This is a port of apache#6770 to the parquet-writer branch. We'll have more of a chance to test this reader, and ensure that we can roundtrip on list types. [Rust] [Parquet] LargeListArray support and why I think the tests are still failing (#6) * Support reading LargeListArrays by making ListArrayReader generic over OffsetSize * Update comment to match the actual values in this test; probably copy-paste * Document why I think the test setup isn't quite right disable list writer tests They're failing because of incorrect def/rep. Will be addressed separately
This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes #8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes #8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes #8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes #8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Nested lists are not yet supported. PR for new test file for datafusion test: apache/parquet-testing#11