-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12411: [Rust] Add Builder interface for adding Arrays to RecordBatches #10063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Err(shared_schema) => shared_schema.as_ref().clone(), | ||
| }; | ||
|
|
||
| let nullable = field_values.null_count() > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a limitation here. If the purpose is to create a single record batch, and that batch is used alone through its lifetime, then this is fine; otherwise we might need to take a nullable parameter.
In any case, I think that if someone uses this to create individual record batches, it'd be inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be an important point to clarify in the comments. If you are creating more than one RecordBatch you should use the existing api to create the RecordBatch from a Schema and Vec<ArrayRef>
If you are creating a single one then this is more convenient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an area that I'm super-familiar with, but are there any existing methods or expectations around merging schemas that differ solely on nullability, e.g. merging two schemas together and inferring the nullability of each field by ORing the nullability on each side of the merge? So if either schema declares the field as nullable, we take the resultant schema as nullable.
If not, then I'd agree that we probably should keep this explicitly declared and not infer it here.
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments :) I like the idea, at the least it'd make constructing batches from arrays in tests more convenient.
| /// Creates a new [`RecordBatch`] with no columns | ||
| /// | ||
| /// TODO add an code example using `append` | ||
| pub fn new() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worthwhile to maintain the property that a recordbatch is ummutable. So, this feels a bit out of place for me.
How about we create a builder that takes an iterator of (&str, ArrayRef) instead?
The below is what I mean by out of place
let mut batch = read_batch_from_somewhere()?;
batch = batch.append("field", array_ref)?;There might be a benefit in having the above, so I'm open to convincing :)
This could partially address your question on the TODO about returning Self on error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense @nevi-me -- I think a separate RecordBatchBuilder sounds like a better idea to me. I'll wait a bit for feedback from others, and then give it a shot.
| /// TODO: code example | ||
| /// | ||
| /// TODO: on error, can we return `Self` in some meaningful way? | ||
| pub fn append(self, field_name: &str, field_values: ArrayRef) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that ArrayRef is backed by an Arc, how about taking it by reference then cloning it?
Am I correct that if this errors, the underlying ArrayRef could get dropped if it's not a clone in itself?
Something like:
let array = Int32Array::from(vec![1, 2, 3]);
let array_ref = Arc::new(array);
// append to existing batch
let mut batch = ...;
// assume that this fails, does array_ref get dropped as we no longer have references to it?
batch = batch.append("ints", array_ref)?;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking we would leave the "do I want to have access to the ArrayRef after the call to append up to the caller,
So like
// append to existing batch
let mut batch = ...;
// We want to use array_ref later, so pass in a clone
batch = batch.append("ints", Arc::clone(array_ref))?;
Which I think is a common pattern in the Rust libraries (that if a function needs to clone its argument, it takes it by value instead and thus gives the caller a chance to avoid an extra copy if they don't need the argument again)
But in this case with an Arc that is cheap to clone I don't think the difference really matters from a performance perspective
|
Thanks for the comments @nevi-me ! |
|
Wouldn't the following work? (untested) let batch: RecordBatch = (&StructArray::try_from(vec![
("a", float_array),
("b", timestamp_array),
])?).into();If we do not want to use a |
@jorgecarleitao that is a really good idea. I learn something new every day! I think using a |
returnString
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a big fan of this idea, especially for making testing cleaner and creating simple MemTable instances within DataFusion. I've done some pretty horrendous things to simplify creation of small batches.
Agreed with the existing comments that it might be more robust to do this via From, or maybe TryFrom to support erroring out, but I do like the simplicity of taking self by value for a quick chained setup too.
| Err(shared_schema) => shared_schema.as_ref().clone(), | ||
| }; | ||
|
|
||
| let nullable = field_values.null_count() > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an area that I'm super-familiar with, but are there any existing methods or expectations around merging schemas that differ solely on nullability, e.g. merging two schemas together and inferring the nullability of each field by ORing the nullability on each side of the merge? So if either schema declares the field as nullable, we take the resultant schema as nullable.
If not, then I'd agree that we probably should keep this explicitly declared and not infer it here.
| } | ||
| } | ||
|
|
||
| impl Default for RecordBatch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally sure about this because it bypasses the invariants of validate_new_batch - it seems like we don't want to allow for 0-column record batches to exist. Might be an argument in favour of the impl From approach?
The only place I know about merging schema arrow/rust/arrow/src/datatypes/field.rs Line 485 in 2f3ed3a
Which by my reading effectively does do an OR of `nullability).
Given my usecase of "make a single record batch from a known set of arrays" I really would prefer to avoid having to include redundant information (nullability). However, given we are going to go with a let batch: RecordBatch = vec![
("a", float_array, false),
("b", timestamp_array, true),
].into();Or similar to specifically specify the nullability. |
|
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
|
Thanks everyone for the feedback -- I have created a PR apache/arrow-rs#7 with an implementation from iterators. |
Purpose
This PR is a draft for comment / review. If people generally like this idea, I will polish up this PR with doc examples / comments and more test for real review.
Rationle / Usecase:
While writing tests (both in IOx and in DataFusion) where I need a single
RecordBatch, I often find myself doing something like this (copied directly from IOx source code):This is annoying because I have to redundantly (and verbosely) encode the information that
float_fieldis a Float64 both in theSchemaand theFloat64ArrayI would much rather be able to construct
RecordBatchesusing a more Rust like style to avoid the the redundancy and reduce the amount of typing / redundancy:Proposal:
Add
RecordBatch::appendso the following syntax can be supported:Existing APIs
The existing APIs to create a
RecordBatchfrom aSchemaandVec<ArrayRef>would not be changed as there are plenty of use cases where the Schema is known up front and should not be checked each time.