Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implemented with_field() for FixedSizeListBuilder #5541

Merged

Conversation

istvan-fodor
Copy link
Contributor

@istvan-fodor istvan-fodor commented Mar 21, 2024

Which issue does this PR close?

Closes #1353

Rationale for this change

FixedSizeListBuilder doesn't allow users to specify the inner field of elements and defaults to the following FieldRef: Arc::new(Field::new("item", values_data.data_type().clone(), true))

We would like to keep the default behavior as-is, but also allow users to specify use their own FieldRef.

What changes are included in this PR?

  1. Added with_field to FixedSizeListBuilder, copied over from GenericListBuilder
  2. Added new field: Option<FieldRef> to FixedSizeListBuilder
  3. Implemented field logic patterned after GenericListBuilder in finish() and finish_build()
  4. Changed unsafe build to safe build (see user-facing changes)
  5. Unit tests for 3 scenarios:
    1. with_field used
    2. with_field used with non-nullable field, expect panic
    3. with_field used with wrong type, expect panic

Are there any user-facing changes?

Yes.

The finish_cloned() and finish() methods no longer use the build_unchecked() method, but rely on build(). If the build fails, the thread panics.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 21, 2024
@istvan-fodor istvan-fodor marked this pull request as draft March 21, 2024 20:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @istvan-fodor -- this looks like a nice API improvement to me.

@alamb alamb marked this pull request as ready for review March 22, 2024 20:54
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @istvan-fodor -- I think this looks good to me. I had a small test organization suggestion but I don't think it is required to merge this PR

I don't think the CI failures are related to changes in this PR -- I filed #5544 to fix.

Thanks again

cc @TimDiekmann

assert_eq!(
f.data_type(),
values_data.data_type(),
"DataType of field ({}) should be the same as the values_builder DataType ({})",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 310 to 326
// [[0, 1, 2], null, [3, null, 5], [6, 7, null]]
builder.values().append_value(0);
builder.values().append_value(1);
builder.values().append_value(2);
builder.append(true);
builder.values().append_null();
builder.values().append_null();
builder.values().append_null();
builder.append(false);
builder.values().append_value(3);
builder.values().append_null();
builder.values().append_value(5);
builder.append(true);
builder.values().append_value(6);
builder.values().append_value(7);
builder.values().append_null();
builder.append(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to avoid repeating this same setup multiple times in the tests (and it would make it clear the tests share the same data)

for example

fn make_list_builder() -> FixedSizedListBuilder<i32> { 
  ...
}
Suggested change
// [[0, 1, 2], null, [3, null, 5], [6, 7, null]]
builder.values().append_value(0);
builder.values().append_value(1);
builder.values().append_value(2);
builder.append(true);
builder.values().append_null();
builder.values().append_null();
builder.values().append_null();
builder.append(false);
builder.values().append_value(3);
builder.values().append_null();
builder.values().append_value(5);
builder.append(true);
builder.values().append_value(6);
builder.values().append_value(7);
builder.values().append_null();
builder.append(true);
let builder = make_list_builder();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll go through the tests and rework the builders.

@istvan-fodor
Copy link
Contributor Author

@alamb , not sure you saw my comment on line 328. Can you check? The new behavior of non-nullable fields raises an additional question about null fields and I wanted to make sure this is in line with the rest of the arrow builders.

@tustvold
Copy link
Contributor

tustvold commented Apr 3, 2024

I could be missing something but I don't see a comment on line 328? Is it possible that it is pending and the review hasn't been submitted?

@@ -248,9 +325,37 @@ mod tests {
}

#[test]
fn test_fixed_size_list_array_builder_finish_cloned() {
fn test_fixed_size_list_array_builder_with_field() {
let builder = make_list_builder(false, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb , I had a question about this test in particular. With the new capability to set the field as a non-nullable we have the situation where the builder has a null element built like this:

 builder.values().append_null();
 builder.values().append_null();
 builder.values().append_null();
 builder.append(false);

If the field is set to non-nullable, this element will fail the null check. I wanted to check with you if this is logically right or not? Are we making a distinction between nulls in the values builder, where let's say 1 of the 3 elements of a particular fixed size list is null (which would be a valid fail on non-nullability) or nulls in the main builder (which is built up as 3 nulls + a append(false) in the values builder). To me the second one is questionable and comes down to how you want this to work. If this test should not fail with the second complete null case, then I have to go back and revise the null check of the values builder and somehow control for null elements of the main builder.

Copy link
Contributor

@tustvold tustvold Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic we apply for FixedSizeList and StructArray elsewhere is to permit "masked" nulls, this is because a null must always take up space. I am not sure if we want to replicate this logic here.

See https://docs.rs/arrow-array/latest/arrow_array/array/struct.FixedSizeListArray.html#method.try_new

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is set to non-nullable, this element will fail the null check. I wanted to check with you if this is logically right or not?

I think I would expect it to be consistent with whatever FixedSlizeListArray::try_new did

@istvan-fodor
Copy link
Contributor Author

I could be missing something but I don't see a comment on line 328? Is it possible that it is pending and the review hasn't been submitted?

Yes, that was the case. I just changed it to a regular comment

@SteampunkIslande
Copy link

Hello ! I'm very interested in this PR, I really need FixedSizeList columns and didn't find a workaround that makes the FixedSizeListBuilder create a column with an appropriate name. So I'm really looking forward for this to be accepted

@tustvold
Copy link
Contributor

tustvold commented Apr 4, 2024

There is currently an ongoing discussion w.r.t null handling, in the short term you could possibly use the array constructors directly - https://docs.rs/arrow-array/latest/arrow_array/array/struct.FixedSizeListArray.html#method.try_new

@SteampunkIslande
Copy link

There is currently an ongoing discussion w.r.t null handling, in the short term you could possibly use the array constructors directly - https://docs.rs/arrow-array/latest/arrow_array/array/struct.FixedSizeListArray.html#method.try_new

Thanks, but I really need a builder. Tried implementing it myself with no luck so far. I'm pretty surprised that the builder doesn't know its field name though. There should be a way to know right ? At least I'm hoping that with your pull request it will be fixed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you @istvan-fodor -- I think this is getting close. Some final touchups:

  1. Update the test for zero length lists (should not fail null check)
  2. Add a test for building a 2 element array where one elemnt is NULL (with a null element)
  3. (maybe) Another test showing the null behavior you are asking about


#[test]
#[should_panic(
expected = "field is nullable = false, but the values_builder contains null values"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right -- the builder doesn't have null values, instead it has no values at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @alamb. this is not a 0 sized list. I created the make_list_builder to construct builders (you recommended something similar when I first opened the PR). The two boolean parameters control whether to include a null in the builder and to include a particular fixed size list that has a null element.

);
if !f.is_nullable() {
assert!(
values_data.null_count() == 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is valid to have a zero length array (which would also have no nulls, but the null count would be zero)

Suggested change
values_data.null_count() == 0,
values_data.null_count() == 0 || values_data.len() == 0,

(same for the check below)

@@ -248,9 +325,37 @@ mod tests {
}

#[test]
fn test_fixed_size_list_array_builder_finish_cloned() {
fn test_fixed_size_list_array_builder_with_field() {
let builder = make_list_builder(false, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is set to non-nullable, this element will fail the null check. I wanted to check with you if this is logically right or not?

I think I would expect it to be consistent with whatever FixedSlizeListArray::try_new did

@istvan-fodor
Copy link
Contributor Author

istvan-fodor commented Apr 8, 2024

@alamb @tustvold , I added some changes.

  1. null check now follows the FixedSizeListArray::try_new logic
  2. Added two new unit tests to test scenario with nulls: test_fixed_size_list_array_builder_with_field_and_null and test_fixed_size_list_array_builder_cloned_with_field_and_null
  3. Test cases for nullable fields and empty builders: test_fixed_size_list_array_builder_with_field_empty and test_fixed_size_list_array_builder_cloned_with_field_empty

@tustvold tustvold merged commit 1b3d1a9 into apache:master Apr 9, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 9, 2024

Thanks you for sticking with this @istvan-fodor 🙏

Also, thank you @tustvold for the follow up in #5612 🚀

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
4 participants