Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BooleanArray::new_from_packed and BooleanArray::new_from_u8 #6127

Merged

Conversation

chloro-pn
Copy link
Contributor

@chloro-pn chloro-pn commented Jul 26, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Users can easily build BooleanArray from &[u8]

What changes are included in this PR?

  • add BooleanArray::new_from_packed
  • add BooleanArray::new_from_u8

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 26, 2024
@tustvold
Copy link
Contributor

I'm not sure about this, as it doesn't provide a mechanism to indicate a length that isn't a multiple of 8 or aligned.

Perhaps you could expand on what the use-case for this is, and why you would be using this as opposed to say BooleanBufferBuilder or similar?

@chloro-pn
Copy link
Contributor Author

I'm not sure about this, as it doesn't provide a mechanism to indicate a length that isn't a multiple of 8 or aligned.

Perhaps you could expand on what the use-case for this is, and why you would be using this as opposed to say BooleanBufferBuilder or similar?

Here, we only convert each bit in &[u8] to boolean and use it to build BooleanArray. It does not involve directly setting certain internal components, so there is no need to consider alignment and whether the length is a multiple of 8.
For example, in some scenarios, we need to use a bitstream (represented as &[u8] in Rust) to construct objects of type BooleanArray, where each bit in the bitstream is 1 for true and 0 for false.
If we didn't have this PR, we would always need to:

  • first build Buffer from &[u8],
  • then build BooleanBuffer from Buffer, and
  • finally build BooleanArray from BooleanBuffer.

Alternatively, we need to convert the bitstream into an iterator object that returns a boolean, but these come at a performance loss.

@chloro-pn
Copy link
Contributor Author

Of course, using this method involves two self-evident facts:

  • There is no null in the constructed BooleanArray
  • The length of the constructed BooleanArray is always a multiple of 8;

If these prerequisites are not met, then this method should not be used.
But these will not cause confusion or difficulty for users to understand, as the mapping from bit stream to boolean array is intuitive.

@alamb
Copy link
Contributor

alamb commented Jul 27, 2024

I wonder if this related to the 8 bit boolean extension type discussion on the dev list: https://lists.apache.org/thread/nz44qllq53h6kjl3rhy0531n2n2tpfr0 ?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal @chloro-pn

arrow-array/src/array/boolean_array.rs Outdated Show resolved Hide resolved
@chloro-pn
Copy link
Contributor Author

I wonder if this related to the 8 bit boolean extension type discussion on the dev list: https://lists.apache.org/thread/nz44qllq53h6kjl3rhy0531n2n2tpfr0 ?

No, we are not building a new type here.
We are just providing a more convenient way to build BooleanArray from bitstream, which will have better performance and be more convenient under the premise of meeting usage requirements.

@chloro-pn chloro-pn force-pushed the support_construct_BooleanArray_from_slice_u8 branch from bcf4fba to 0420332 Compare July 27, 2024 12:42
@chloro-pn chloro-pn requested a review from alamb July 27, 2024 16:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more last night.

I think my main concern is if the semantics of BooleanArray::from(&[u8]) will be "surprising" to people

Thus I would like to propose making this a function like BooleanArray::from_u8 like

let arr = BooleanArray::from_u8(..)

We can put the documentation on that method. I think that would avoid any concern about confusion (as users would could the method explicitly).

What do you think?

@@ -50,6 +50,22 @@ use std::sync::Arc;
/// assert_eq!(&values, &[Some(true), None, Some(false), None, Some(false)])
/// ```
///
/// # Example: From `&[u8]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@chloro-pn
Copy link
Contributor Author

I thought about this more last night.

I think my main concern is if the semantics of BooleanArray::from(&[u8]) will be "surprising" to people

Thus I would like to propose making this a function like BooleanArray::from_u8 like

let arr = BooleanArray::from_u8(..)

We can put the documentation on that method. I think that would avoid any concern about confusion (as users would could the method explicitly).

What do you think?

I think your suggestion is more appropriate, and I will modify the code later.
The use of &[u8] to represent a bitstream is not so obvious and consistent (&[u8] can only represent a partial bitstream).
The From trait is not only used to represent conversions between types, but also declares a semantically equivalent relationship.I get it.
I am a beginner in Rust, thanks for your code review: )

@tustvold
Copy link
Contributor

tustvold commented Jul 28, 2024

Might I suggest something along the lines of

fn new_from_packed(buffer: impl Into<Buffer>, offset: usize, len:usize)

This would be explicit and consistent with other methods.

Edit: although I wonder if this is then really all that different from

let a: BooleanArray = BooleanBuffer(b.into(), 0, len).into();

I wonder if we really need a new method...

@chloro-pn
Copy link
Contributor Author

chloro-pn commented Jul 28, 2024

Might I suggest something along the lines of

fn new_from_packed(buffer: impl Into<Buffer>, offset: usize, len:usize)

This would be explicit and consistent with other methods.

Edit: although I wonder if this is then really all that different from

let a: BooleanArray = BooleanBuffer(b.into(), 0, len).into();

I wonder if we really need a new method...

This is what I mentioned in my previous reply "providing a more convenient way to build BooleanArray from bitstream".
As a developer who is very familiar with arrow-rs, this method is unnecessary, but for users, I think the provided interface should try to hide the internal structure as much as possible and maintain simplicity.

new_from_packed is a good idea, perhaps we should implement this more universal version.

@tustvold
Copy link
Contributor

tustvold commented Jul 28, 2024

My feeling is this convenience comes at the expense of being rather confusing to users, especially those less familiar with the internal layout of arrow arrays. My 2 cents is if people want a convenient and easy interface they can use the builders or iterators, if they want lower level integration they use lower level interfaces and construct the array from parts, providing a method that is convenient but correct usage requires knowledge of the internal layout seems undesirable.

If Andrew is happy to move forward with this I have no objection to this, but I'm not sold

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

I agree that it takes a lot of knowledge until this is obvious:

let a: BooleanArray = BooleanBuffer(b.into(), 0, len).into();

It seems to me adding an example of how to do this in the documentation is valuable in any case.

I also think it is valuable to add APIs that make arrow-rs easier to use for less expert users, even if that means they aren't the most concise, and thus I think an API such as proposd in the PR is a good addition

@chloro-pn chloro-pn requested a review from alamb July 29, 2024 13:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @chloro-pn -- I think this is a nice addition to the API that is well commented and explains what it does well.


/// Create a new [`BooleanArray`] from `&[u8]`
/// This method uses `new_from_packed` and constructs a [`Buffer`] using `value`, and offset is set to 0 and len is set to `value.len() * 8`
/// using this method will make the following points self-evident:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many people have said it before, but I still want to say that you are a very nice person. Thank you : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is all part of my master plan to get everyone working together well to build this ecosystem together -- it is great fun.

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

Thank you for the review and comments @tustvold

@alamb alamb changed the title Support construct BooleanArray from &[u8] Add BooleanArray::new_from_packed and BooleanArray::new_from_u8 Jul 30, 2024
@alamb alamb merged commit 6e893b5 into apache:master Jul 30, 2024
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 30, 2024

Thanks again @chloro-pn and @tustvold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants