Skip to content

Conversation

@mcassels
Copy link

@mcassels mcassels commented Mar 30, 2020

Nested lists are not yet supported. PR for new test file for datafusion test: apache/parquet-testing#11

@github-actions
Copy link

@mcassels
Copy link
Author

@andygrove @nevi-me do you have any thoughts on this?

@andygrove andygrove changed the title ARROW-7842 [Parquet][Rust] implement array_reader for list type columns ARROW-7842: [Rust] [Parquet] implement array_reader for list type columns Apr 22, 2020
@andygrove andygrove requested review from nevi-me and paddyhoran April 22, 2020 23:31
\n TableScan: employee.csv projection=Some([0, 3])";
\n Selection: #state Eq Utf8(\"CO\")\
\n TableScan: employee.csv projection=Some([0, 3])";

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these formatting changes from running rustfmt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

let list = $column
.as_any()
.downcast_ref::<array::ListArray>()
.unwrap()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect would be better than unwrap. Using a Result would be better than expect.

Ok(values) => values,
_ => {
return Err(ExecutionError::ExecutionError(format!(
"Unsupported {:?} type for repl.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does repl mean here? Maybe this error could be a bit more descriptive?

DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => {
make_string!(array::Time64NanosecondArray, column, row)
}
DataType::List(_) => make_string_from_list!(column, row),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter what data type the list is? Can lists of any type be converted into strings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursively calls array_value_to_string on the array items, so an unsupported type should give the appropriate error on the recursive call

@andygrove andygrove requested a review from sunchao April 22, 2020 23:38
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I wasn't aware of this PR until you mentioned me. My email filters didn't match the title ... this looks great overall. I left a few comments and would like to see some feedbackfrom someone more familiar with the parquet crate. I've requested some other reviewers take a look. Thanks for the contribution!

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ArrowType::Time64(ArrowTimeUnit::Nanosecond) => {
build_empty_list_array_with_primitive_items!(ArrowTime64NanosecondType)
}
ArrowType::Duration(ArrowTimeUnit::Second) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are durations supported in parquet?

Copy link
Author

@mcassels mcassels May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the PrimitiveArrayReader implementation supports Duration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably incorrect, but we can clarify/fix the behaviour of the readers at a later stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably incorrect, but we can clarify/fix the behaviour of the readers at a later stage.

@maxburke
Copy link
Contributor

maxburke commented May 5, 2020

Is there anything else that needs to be taken care of in order to have this merged?

@nevi-me
Copy link
Contributor

nevi-me commented May 5, 2020

Is there anything else that needs to be taken care of in order to have this merged?

We need to wait for apache/parquet-testing#11 to be merged, then update the submodule so that the tests pass.

@sunchao @paddyhoran may you please also have a look at this when you get a chance. I have some slight concerns around Arrow <-> Parquet format support, but I'm comfortable that we'll be able to pick up any required changes when we progress with the Arrow to Parquet writer.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for the late review. Have being out of touch for some time and is taking time to get back to the latest state of Arrow reader in parquet-rs. Left some comments but will probably add more later.


/// Build array reader for list type.
/// Currently this is not supported.
fn visit_list_with_item(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being called anywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}};
}

fn build_empty_list_array(item_type: ArrowType) -> Result<ArrayRef> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is implemented isn't ideal as it has to do pattern matching on every possible type. Is there a way to simplify this? For instance, every ArrowPrimitiveType already has a DataType as its field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether there's a way to do the association in the opposite direction. In this case we want to go from a DataType to an ArrowPrimitiveType. I found that pattern matching is currently used in other similar situations like https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L433 for example

&self.data_type
}

fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle nested lists?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this PR doesn't handle nested lists. At the end of visit_list_with_item there's a check on the item type and it gives an error if the List item type is List https://github.com/apache/arrow/pull/6770/files#diff-fca053527b552a9f2abb1e280e570cb6R1197

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It will be good if we have a path to get there. Can also reference the C++ impl. Also could you add some comments in this to indicate that nested lists are not supported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added a couple comments -- on the ArrayReader implementation and on visit_list_with_item

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcassels thanks for your patience and I added more comments on this PR.

&& (rep_levels.len() == next_batch_array.len()))
{
return Err(ArrowError(
"Expected item_reader def_level and rep_level arrays to have the same length as batch array".to_string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: long line

if next_batch_array.len() == 0 {
return build_empty_list_array(item_type);
}
let def_levels = self.item_reader.get_def_levels().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

propagate possible errors instead of unwrap?

let mut new_context = context.clone();

let list_child = &list_type.get_fields()[0];
let item_child = &list_child.get_fields()[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look correct. I think we should use the _item_type passed in for item type. List type in Parquet can be represented with several different kinds of schema as mentioned here. This method already helped to setup the list item type for you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes more sense. I'm not sure why I did it that way before

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing test made me realize why I had it that way. Currently item_type is passed as &Type to visit_list_with_item while list_type is passed as Rc<Type>. Because item_type is not passed as Rc<Type>, if item_type is used then when the list items are visited by PrimitiveArrayReader, they will not satisfy the self.is_included check, because that checks using the reference.

I've now changed the signature of visit_list_with_item to pass item_type as Rc<Type> to fix this issue.

&self.data_type
}

fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It will be good if we have a path to get there. Can also reference the C++ impl. Also could you add some comments in this to indicate that nested lists are not supported?


// null list has def_level = 0
// empty list has def_level = 1
// null item in a list has def_level = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to check if the list item is required or optional right? the list itself can also be either optional or requried.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems that this comment is not correct... I think it would be correct just in the case where both the list and the list item are optional.

However, much of the logic in this comment is not actually used: Rep and def levels are compared only with 0. Do you see any flaws in the def and rep level logic that comes after this comment? If the actual logic is ok I'll just remove the comment.

}
let batch_values = match null_list_indices.len() {
0 => next_batch_array.clone(),
_ => remove_indices(next_batch_array.clone(), item_type, null_list_indices)?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why you need to process the batch array and add them through a builder? why cannot just directly use the array in the result ListArray and set up validity bits and offsets only?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, the null values that represent a null list rather than a null list item need to not be present in the child data of the final ListArray. I don't like that this uses the builder either, but I wasn't sure how else to create that child data array that has specific indices excluded... I might be misunderstanding something though. Do you have any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand you correctly. I think a null list is represented with the combination of valid bit array and offset array, and should have no presence in the value array (example here). Therefore, the input array can just be passed intact to the ListArray as value array.

It would also be useful to refer the C++ impl.

@emkornfield
Copy link
Contributor

@nevi-me did your other PRs supercede this?

@nevi-me
Copy link
Contributor

nevi-me commented Sep 16, 2020

I'll pick this up in the coming 2 weeks. @maxburke @mcassels I might need write access to the urbanlogiq fork, but I can ping you privately and ask for access temporarily until we get this merged.

@nevi-me
Copy link
Contributor

nevi-me commented Sep 26, 2020

This has sadly fallen behind significantly from the main branch. I've spent a few hours trying to rebase, but because it looks like the PR contained changes for the below, it's a bit difficult to follow.

  • filter kernel for lists
  • contains kernel (also can't tell if it was introduced then removed in subsequent commits, but looks like that)
  • the parquet list reader

I got as far as getting everything except the filter kernel working, but I suspect I'll be spending a few more hours dealing with merge conflicts; so I'm throwing in the towel for now. I'll try a different strategy of reimplementing the parquet changes on the main branch separately.

@yordan-pavlov would you have capacity to help with filter kernels for lists? We would use @mcassels work, but convert it to use the FilterContext that you introduced (IIRC).

@nevi-me
Copy link
Contributor

nevi-me commented Sep 26, 2020

@nevi-me did your other PRs supercede this?

I hadn't seen this @emkornfield. No, I've only been focusing on the writer, so I'll still want to merge this into the branch currently called master.
I've also noticed that we could split the PR into 2 or 3, so I'll try do that as I've spent a few hours trying to rebase; and I'll have to throw away that time :(

@yordan-pavlov
Copy link
Contributor

@nevi-me I am happy to help but I couldn't see any changes to the filter kernel in this PR; have I misunderstood your message?

@nevi-me
Copy link
Contributor

nevi-me commented Sep 26, 2020

@nevi-me I am happy to help but I couldn't see any changes to the filter kernel in this PR; have I misunderstood your message?

Apologies, this PR came from a fork that UrbanLogiq maintain, so I think they had worked on list filters for their purposes, and then initially included those commits in this PR, but eventually reversed them.

I think it's still good to incorporate those changes, as the relevant JIRA has been open for a while (ARROW-5350).

I've extracted the changes and pushed them to my fork at nevi-me@4c11f46. You'll see that I commented out the filter on lists, as I couldn't figure out how to make them work with FilterContext as mentioned.

There's already some tests, so looks like a solid base to start from. You can copy the changes from my commit, or push into my ARROW-5350 branch; either's fine with me.

@yordan-pavlov
Copy link
Contributor

thanks @nevi-me , I understand now, will have a look this week

@mcassels
Copy link
Author

@nevi-me I am happy to help but I couldn't see any changes to the filter kernel in this PR; have I misunderstood your message?

Apologies, this PR came from a fork that UrbanLogiq maintain, so I think they had worked on list filters for their purposes, and then initially included those commits in this PR, but eventually reversed them.

I think it's still good to incorporate those changes, as the relevant JIRA has been open for a while (ARROW-5350).

I've extracted the changes and pushed them to my fork at nevi-me@4c11f46. You'll see that I commented out the filter on lists, as I couldn't figure out how to make them work with FilterContext as mentioned.

There's already some tests, so looks like a solid base to start from. You can copy the changes from my commit, or push into my ARROW-5350 branch; either's fine with me.

Yes we had worked on list filters as well but intended to make that a separate PR. But as you can see unfortunately we got busy and fell behind main :/

@velvia
Copy link
Contributor

velvia commented Sep 29, 2020

@nevi-me @yordan-pavlov Hi so I joined UrbanLogiq recently and am helping to maintain this stuff. Turns out this has become important to fix and merge for us too and we need to bring our branch up to date. Let us know best way to go forward, but one of us is happy to start porting this change to a newer updated branch and let everyone have a look, and push a formal Arrow PR as well.

@nevi-me
Copy link
Contributor

nevi-me commented Sep 30, 2020

@nevi-me @yordan-pavlov Hi so I joined UrbanLogiq recently and am helping to maintain this stuff. Turns out this has become important to fix and merge for us too and we need to bring our branch up to date. Let us know best way to go forward, but one of us is happy to start porting this change to a newer updated branch and let everyone have a look, and push a formal Arrow PR as well.

I've already ported the changes to a separate branch, and I'm currently going through @sunchao's review. I ran out of time over the weekend, but I might be able to complete the changes in the coming week.

I know it creates a dependency on me, but given the hours I've already spent on this; I'd rather still continue and address the review, as I might be able to merge the new PR immediately if I don't have questions for anyone.

@velvia
Copy link
Contributor

velvia commented Sep 30, 2020

I know it creates a dependency on me, but given the hours I've already spent on this; I'd rather still continue and address the review, as I might be able to merge the new PR immediately if I don't have questions for anyone.

@nevi-me that sounds great, looking forward to it! There is another feature we might look to push from our branch sometime. :)

@nevi-me
Copy link
Contributor

nevi-me commented Oct 12, 2020

Closed in favour of #8449

@nevi-me nevi-me closed this Oct 12, 2020
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 16, 2020
This is a port of apache#6770 to the parquet-writer branch.
We'll have more of a chance to test this reader,
and ensure that we can roundtrip on list types.
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
This is a port of apache#6770 to the parquet-writer branch.
We'll have more of a chance to test this reader,
and ensure that we can roundtrip on list types.

[Rust] [Parquet] LargeListArray support and why I think the tests are still failing (#6)

* Support reading LargeListArrays by making ListArrayReader generic over OffsetSize

* Update comment to match the actual values in this test; probably copy-paste

* Document why I think the test setup isn't quite right

disable list writer tests

They're failing because of incorrect def/rep.
Will be addressed separately
nevi-me added a commit that referenced this pull request Oct 17, 2020
This is a port of #6770 to the parquet-writer branch.

We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types.

Closes #8449 from nevi-me/ARROW-7842-cherry

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 25, 2020
This is a port of #6770 to the parquet-writer branch.

We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types.

Closes #8449 from nevi-me/ARROW-7842-cherry

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 27, 2020
This is a port of #6770 to the parquet-writer branch.

We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types.

Closes #8449 from nevi-me/ARROW-7842-cherry

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 28, 2020
This is a port of #6770 to the parquet-writer branch.

We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types.

Closes #8449 from nevi-me/ARROW-7842-cherry

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants