-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Feat: Support array flatten() on List(LargeList(_)) types
#18363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently is not a finished solution.
Still need to figure out how to create a testcase where offsets would not be able to downcast to i32.
Should it be a sqllogictest test using an array with i32::MAX + 1 values or something else?
| List(field) | LargeList(field) | FixedSizeList(field, _) => { | ||
| List(Arc::clone(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this only supports arrays that can be converted from LargeList to List.
| let (inner_field, inner_offsets, inner_values, _) = | ||
| as_large_list_array(&values)?.clone().into_parts(); | ||
| // Try to downcast the inner offsets to i32 | ||
| match downcast_i64_inner_to_i32(&inner_offsets, &offsets) { | ||
| Ok(i32offsets) => { | ||
| let flattened_array = GenericListArray::<i32>::new( | ||
| inner_field, | ||
| i32offsets, | ||
| inner_values, | ||
| nulls, | ||
| ); | ||
| Ok(Arc::new(flattened_array) as ArrayRef) | ||
| } | ||
| // If downcast fails we keep the offsets as is | ||
| Err(_) => { | ||
| // Fallback: keep i64 offsets → LargeList<i64> | ||
| let i64offsets = keep_offsets_i64(inner_offsets, offsets); | ||
| let flattened_array = GenericListArray::<i64>::new( | ||
| inner_field, | ||
| i64offsets, | ||
| inner_values, | ||
| nulls, | ||
| ); | ||
| Ok(Arc::new(flattened_array) as ArrayRef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are trying to downcast the inner array offsets from i64 to i32 and if fail we fallback to i64 offsets.
The fallback is not yet supported by the return_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me as it means the return value of the function might not conform to the stated return_type
If return type says the return array will be ListArray for a LargeListArray input, then the invoke function must return that value (or an error)
SO I think we need to either update the implementation and return typetype to return LargeListArray for LargeListArray inputs or else return an error if we can't cast the offsets correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at @Jefffrey's original comment suggesting that we should try to downcast it i32 returning a ListArray and alternatively return a LargeListArray if downcast is not possible.
He also mentioned that upcasting the parent offsets to LargeListArray blindly would be inefficient and undesirable.
I wanted to see if it's possible to infer the possibility of downcasting the inner array offsets within the return_type function. Then return the matching output type.
We could do it by checking if: inner_offsets.last() <= i32::MAX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I did mention that, however I did make a note that:
though this might be tricky considering
return_type()wouldn't know this until execution
I think we make a choice, either always upcast to large if one of the inner lists is a large list, or just error.
I guess the former is more favourable/robust than the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, if the consensus is to always upcast to LargeListArray I'll proceed with it.
| } | ||
| LargeList(_) => { | ||
| let (inner_field, inner_offsets, inner_values, nulls) = | ||
| let (inner_field, inner_offsets, inner_values, _) = // _ instead of nulls? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here flattened_array was generated using the inner nulls instead of the outer ones. I figured it should follow the same format and use nulls from the outer array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting; so this might be a bug in current behaviour? Do you think you can mock up a test where it fails on main without this fix?
| inner_offsets: OffsetBuffer<O>, | ||
| outer_offsets: OffsetBuffer<O>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little naming change because the previous one was confusing.
| // Function for converting LargeList offsets into List offsets | ||
| fn downcast_i64_inner_to_i32( | ||
| inner_offsets: &OffsetBuffer<i64>, | ||
| outer_offsets: &OffsetBuffer<i32>, | ||
| ) -> Result<OffsetBuffer<i32>, ArrowError> { | ||
| let buffer = inner_offsets.clone().into_inner(); | ||
| let offsets: Result<Vec<i32>, _> = outer_offsets | ||
| .iter() | ||
| .map(|i| buffer[i.to_usize().unwrap()]) | ||
| .map(|i| { | ||
| i32::try_from(i) | ||
| .map_err(|_| ArrowError::CastError(format!("Cannot downcast offset {i}"))) | ||
| }) | ||
| .collect(); | ||
| Ok(OffsetBuffer::new(offsets?.into())) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is trying to downcast offsets to i32.
- On success it returns an
OffsetBuffer<i32> - If it fails it errors out.
| } | ||
|
|
||
| // In case the conversion fails we convert the outer offsets into i64 | ||
| fn keep_offsets_i64( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function takes the outer i32 and inner i64 offsets and keeps the inner i64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should be renamed (and docstring adjusted) now? keep_offsets_i64 is a bit confusing to read for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this @sdf-jkl
| let (inner_field, inner_offsets, inner_values, _) = | ||
| as_large_list_array(&values)?.clone().into_parts(); | ||
| // Try to downcast the inner offsets to i32 | ||
| match downcast_i64_inner_to_i32(&inner_offsets, &offsets) { | ||
| Ok(i32offsets) => { | ||
| let flattened_array = GenericListArray::<i32>::new( | ||
| inner_field, | ||
| i32offsets, | ||
| inner_values, | ||
| nulls, | ||
| ); | ||
| Ok(Arc::new(flattened_array) as ArrayRef) | ||
| } | ||
| // If downcast fails we keep the offsets as is | ||
| Err(_) => { | ||
| // Fallback: keep i64 offsets → LargeList<i64> | ||
| let i64offsets = keep_offsets_i64(inner_offsets, offsets); | ||
| let flattened_array = GenericListArray::<i64>::new( | ||
| inner_field, | ||
| i64offsets, | ||
| inner_values, | ||
| nulls, | ||
| ); | ||
| Ok(Arc::new(flattened_array) as ArrayRef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me as it means the return value of the function might not conform to the stated return_type
If return type says the return array will be ListArray for a LargeListArray input, then the invoke function must return that value (or an error)
SO I think we need to either update the implementation and return typetype to return LargeListArray for LargeListArray inputs or else return an error if we can't cast the offsets correctly
| } | ||
|
|
||
| // In case the conversion fails we convert the outer offsets into i64 | ||
| fn keep_offsets_i64( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should be renamed (and docstring adjusted) now? keep_offsets_i64 is a bit confusing to read for me
| select flatten(arrow_cast(make_array([1], [2, 3], [null], make_array(4, null, 5)), 'FixedSizeList(4, LargeList(Int64))')), | ||
| flatten(arrow_cast(make_array([[1.1], [2.2]], [[3.3], [4.4]]), 'List(LargeList(FixedSizeList(1, Float64)))')); | ||
| ---- | ||
| [1, 2, 3, NULL, 4, NULL, 5] [[1.1], [2.2], [3.3], [4.4]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should also add a check for the output type via arrow_typeof() to ensure we are indeed getting a large list
| } | ||
| LargeList(_) => { | ||
| let (inner_field, inner_offsets, inner_values, nulls) = | ||
| let (inner_field, inner_offsets, inner_values, _) = // _ instead of nulls? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting; so this might be a bug in current behaviour? Do you think you can mock up a test where it fails on main without this fix?
|
Thanks for your review @Jefffrey, I've addressed the issues you pointed out. I'll check the potential bug in |
|
I made a test that points out the bug on a different branch, should I create a separate issue and a PR for it? |
|
Looks like some tests are failing in CI |
|
@Jefffrey Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
flatten()onList(LargeList(_))types #17670Rationale for this change
What changes are included in this PR?
Added a
flatten()List(LargeList)test to thesqllogictestAdded support for array
flatten()onList(LargeList(_))typesAre these changes tested?
sqllogictestpasses, but I still need to implement a test where offsets could not be downcasted from i64 to i32Are there any user-facing changes?
Users will be able to use
flattenonList(LargeList)types