Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Feb 5, 2021

This PR fixes a bug on which GenericListArray is not validating the datatype passed on to from(ArrayData), causing all types of bugs, such as undefined behavior in interpreting the offset buffer.

This PR adds this validation, panicking if the DataType does not match.

This PR also fixes casting from and to Lists, which was creating an ArrayData out of spec.

@github-actions
Copy link

github-actions bot commented Feb 5, 2021

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorgecarleitao for pinging. Left a few comments.

let values = data.child_data()[0].clone();

if let Some(child) = Self::get_type(data.data_type()) {
assert_eq!(values.data_type(), child, "[Large]ListArray's child datatype does not correspond to the List's datatype");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: long line? I remember we enforce a limit of 100 characters.


let values = data.child_data()[0].clone();

if let Some(child) = Self::get_type(data.data_type()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a few tests to cover this? checking error message and all.

Also I'm not sure if assert_eq is good here: IMO assertion should only be used for checking internal logic that developer should follow and which are not exposed to the library users, but in this case it appears not. It's just a nit though since this is already used in multiple places before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. However, that requires a larger change as we would need to move from From to TryFrom, so for now I just want to avoid unsafe code by panicking everytime something may go wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a mid-way proposal: a31a35a

Basically, to use normal rust handling, but the make the Into implementation expect the result

I actually think using asserts / panics directly (as in this PR) is also fine beacuse:

  1. it is an improvement over the current behavior (crash / undefined) to get useful error messages (even if it is in a panic :( )
  2. the use of ArrayData in my mind is also an implementation detail of an Array so most users of Arrow shouldn't be interacting with this code at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I also tried using TryFrom directly and as @jorgecarleitao suspected there are many kernel implementations that rely in this being infallable.

false
}

fn prefix() -> &'static str {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we won't need prefix anymore with the new is_large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might still need it, we also use it for formatting in Display

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can drop it, yes. We can merge StringOffset, BinaryOffset and OffsetTrait in a single Trait with this, but I wanted to leave it to another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one way to remove prefix that does not go as far as @jorgecarleitao suggests to collapse the traits... 8e68e05

@jhorstmann
Copy link
Contributor

Nice! I recently ran into the same issue and the assertion about the nested datatypes would have saved me some time debugging. In my testcase an assert_eq was failing, but printed both sides exactly the same.

@jorgecarleitao
Copy link
Member Author

@jhorstmann , I think that that shows another, separate issue: imo what we currently show in "debug" should be shown in "Display", and "Debug" should actually show the full structure, i.e. the one created by #[derive(Debug)].

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional comments, other than @sunchao's

false
}

fn prefix() -> &'static str {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might still need it, we also use it for formatting in Display

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao do you need help with this PR? I can try and take some of @sunchao 's comments if that would help

@jorgecarleitao
Copy link
Member Author

@alamb that would definitely help. If you have the time, I would really appreciate.

@alamb
Copy link
Contributor

alamb commented Feb 15, 2021

@alamb that would definitely help. If you have the time, I would really appreciate.

@jorgecarleitao I will put it on my queue for tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good to merge as is as it makes the code better than on master (less errors), though it can be further improved.

@sunchao when you get some time I would love your feedback / advice -- should we merge this PR as is? Would you suggest incorporating one/both of the approaches prototyped in #9508 (remove prefix and making a fallable version of ArrayData --> GenericListArray)

@alamb alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 18, 2021
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 - I think we can merge this as it is and solve the comments separately. Thanks!

BTW this needs rebase.

@nevi-me
Copy link
Contributor

nevi-me commented Feb 26, 2021

@alamb can we merge this when CI is green, then treat #9508 as a separate JIRA ticket? This had fallen quite behind, but I've rebased and fixed failures on it, when I should have done it against #9508 instead.

@nevi-me nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Feb 26, 2021
@codecov-io
Copy link

Codecov Report

Merging #9425 (979a136) into master (5bea624) will increase coverage by 0.07%.
The diff coverage is 85.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9425      +/-   ##
==========================================
+ Coverage   82.25%   82.33%   +0.07%     
==========================================
  Files         244      245       +1     
  Lines       55685    56270     +585     
==========================================
+ Hits        45806    46330     +524     
- Misses       9879     9940      +61     
Impacted Files Coverage Δ
rust/arrow/src/alloc/types.rs 0.00% <0.00%> (ø)
rust/datafusion/src/logical_plan/expr.rs 81.56% <ø> (+0.42%) ⬆️
...ion-testing/src/bin/arrow-json-integration-test.rs 0.00% <0.00%> (ø)
rust/arrow/src/ipc/writer.rs 87.23% <50.00%> (-0.59%) ⬇️
rust/arrow/src/array/array_list.rs 92.95% <70.00%> (-1.10%) ⬇️
...datafusion/src/physical_plan/string_expressions.rs 77.00% <82.23%> (+7.37%) ⬆️
...t/datafusion/src/physical_plan/coalesce_batches.rs 84.95% <83.33%> (-0.23%) ⬇️
rust/parquet/src/arrow/array_reader.rs 77.61% <91.30%> (-0.02%) ⬇️
rust/datafusion/src/physical_plan/functions.rs 85.52% <91.42%> (+11.69%) ⬆️
rust/arrow/src/alloc/mod.rs 92.68% <92.68%> (ø)
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4da5822...06ef10c. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Feb 26, 2021

@nevi-me good plan!

"generated_dictionary",
// "generated_duplicate_fieldnames",
"generated_interval",
"generated_large_batch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me I don't remember seeing this in the original PR -- was this change intended ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NM I see #9587 now

@alamb alamb closed this in acd2a47 Feb 26, 2021
alamb added a commit that referenced this pull request Mar 14, 2021
# Background:
Left over cleanups suggested by from @sunchao on  #9425

Broken out from #9508

# Rationale:
This function is redundant with `OffsetSize::is_large`

Closes #9690 from alamb/alamb/remove_prefix

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
alamb added a commit that referenced this pull request Mar 15, 2021
…ead of `panic!`

# Background:
Left over cleanups suggested by from @sunchao on  #9425

Broken out from #9508

# Rationale:

Don't use panic! directly. However,  since the caller of this function still calls `unwrap()`, I am not sure how much of an improvement this change really is. However it may set us up for a more `safe` future eventually

Closes #9691 from alamb/alamb/fallable_list_conversion

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
# Background:
Left over cleanups suggested by from @sunchao on  apache/arrow#9425

Broken out from apache/arrow#9508

# Rationale:
This function is redundant with `OffsetSize::is_large`

Closes #9690 from alamb/alamb/remove_prefix

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…ead of `panic!`

# Background:
Left over cleanups suggested by from @sunchao on  apache/arrow#9425

Broken out from apache/arrow#9508

# Rationale:

Don't use panic! directly. However,  since the caller of this function still calls `unwrap()`, I am not sure how much of an improvement this change really is. However it may set us up for a more `safe` future eventually

Closes #9691 from alamb/alamb/fallable_list_conversion

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants