Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fields abstraction (#3955) #3965

Merged
merged 6 commits into from
Mar 30, 2023
Merged

Add Fields abstraction (#3955) #3965

merged 6 commits into from
Mar 30, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 27, 2023

Which issue does this PR close?

Part of #3955

Rationale for this change

This adds a cheaply cloneable Fields abstraction, that internally contains FieldRef within a reference counted slice.

This achieves a couple of things:

  • The use of FieldRef allows projecting / reconstructing schema without needing to copy Field
  • The use of Arc<[FieldRef]> allows cheap cloning of DataType, construction of DataType::Struct, etc...
  • Allows cheap pointer comparisons between Schema and DataType
  • Provides a potential extension point for faster field name lookups

What changes are included in this PR?

Are there any user-facing changes?

Yes, this makes a fundamental change to the schema representation

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 27, 2023
/// ```
///
#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
pub struct Fields(Arc<[FieldRef]>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally defined Fields = Vec<FieldPtr>

Whilst simple the lack of a newtype made for a more convoluted migration, with a newtype we can define conversions From<Vec<Field>>, etc... to help reduce friction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a better formulation than a typedef and will allow for more flexibility

@@ -182,7 +183,7 @@ pub enum DataType {
/// A single LargeList array can store up to [`i64::MAX`] elements in total
LargeList(Box<Field>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick follow PR would then replace Box<Field> with FieldRef

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another followup could be done for Union, although that would also benefit from a Vec<(Field, i8)> instead of two separate vectors. I think that also currently makes it the largest variant, which increases the needed size of all datatypes.

A slightly hacky improvement for union could also be to move the type_id into Field and leave it unused in most places. That should basically be free since Field already has a few bits of padding left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to do the other variants in a follow up. I think changing it to (Fields, Arc<[i8]>, UnionMode) may be sufficient and would keep things simple


impl From<Vec<FieldRef>> for Fields {
fn from(value: Vec<FieldRef>) -> Self {
Self(value.into())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps worth highlighting that this is implemented as a memove, it cannot reuse the allocation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot reuse the allocation

If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly the implementation will always move regardless, I think it is some limitation of unsized coercion

@tustvold
Copy link
Contributor Author

FYI @alamb @crepererum @viirya I would appreciate your thoughts on this

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it 👍

arrow-schema/src/fields.rs Show resolved Hide resolved

impl From<Vec<FieldRef>> for Fields {
fn from(value: Vec<FieldRef>) -> Self {
Self(value.into())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot reuse the allocation

If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like it

/// ```
///
#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
pub struct Fields(Arc<[FieldRef]>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a better formulation than a typedef and will allow for more flexibility

pub fn size(&self) -> usize {
self.iter().map(|field| field.size()).sum()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While constructing / Modifying lists of fields, I think it would be great if we could also add functions like

/// Maybe something more generic to allow adding a Field and FieldREf
pub fn push(mut &self, field: Field...) {
...
}

Copy link
Contributor Author

@tustvold tustvold Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can make SchemaBuilder public and add such a method to it, Fields itself is inherently immutable


use crate::error::ArrowError;
use crate::field::Field;
use crate::{FieldRef, Fields};

/// A builder to facilitate building a [`Schema`] from iteratively from [`FieldRef`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think is one of the most expensive operations in DataFusion planning now: apache/datafusion#5157 (comment)

So 👍

@github-actions github-actions bot added arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate labels Mar 28, 2023
},
);

let iter = v.into_iter();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to rework these once StructArray::new exists as part of #3880

@@ -302,6 +306,21 @@ impl From<(Vec<(Field, ArrayRef)>, Buffer)> for StructArray {
}
}

impl From<RecordBatch> for StructArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the existing implementation in record_batch.rs with a more optimal implementation

@@ -467,19 +466,6 @@ impl From<&StructArray> for RecordBatch {
}
}

impl From<RecordBatch> for StructArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to struct_array.rs

self.iter().map(|field| field.size()).sum()
}

/// Searches for a field by name, returning it along with its index if found
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be an obvious place to add hash based lookup or similar

@tustvold tustvold marked this pull request as ready for review March 28, 2023 18:37
Comment on lines 787 to +788
let struct_type =
DataType::Struct(vec![Field::new("data", DataType::Int64, false)]);
DataType::Struct(vec![Field::new("data", DataType::Int64, false)].into());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into works like Fields::from here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I switched between the two to make rustfmt happy 😅


/// A cheaply cloneable, owned slice of [`FieldRef`]
///
/// Similar to `Arc<Vec<FieldPtr>>` or `Arc<[FieldPtr]>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FieldPtr? Do you mean FieldRef?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction looks good and datatype/schema manipulation can be more efficient.

@tustvold
Copy link
Contributor Author

Notified the mailing list about this - https://lists.apache.org/thread/pmxq5j864qlkp36lvxg8kvk0kct56r8m

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I think this looks in general very good.

My biggest concern is on the amount of API churn that this will generate -- I think there may be a way to reduce the churn and make this PR smaller, and I left comments to that effect.

Once we sort it out and get this merged, I think we should then try (almost immediately) to upgrade some other project that makes significant use of arrow-rs to see how painful the upgrade is (and if there are other ergonomic things that could be done to ease the transition pain)

Thank you again for pushing this through

arrow-array/src/builder/struct_builder.rs Show resolved Hide resolved
@@ -178,10 +181,10 @@ mod tests {
),
Field::new(
"c25",
DataType::Struct(vec![
DataType::Struct(Fields::from(vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone who uses DataType::Struct this is now getting complicated to construct

I wonder if we can ease the pain by having something

impl DataType {
  fn new_struct(fields: impl Into<Fields>) -> Self {
 ..
}

So then this could be

Suggested change
DataType::Struct(Fields::from(vec![
DataType::new_struct(vec![

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by this, the major reason for using the more verbose DataType::Struct(Fields::from(..)) was to reduce formatting churn, most downstreams will just be able to use .into().

I'll have a go upgrading DataFusion to assess the churn required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in my mind it is about the cognative load. Now I need to know what a Fields is, import it, construct one, etc.

Maybe the magic into() will make it ok

@tustvold
Copy link
Contributor Author

DataFusion upgrade PR - apache/datafusion#5782

@tustvold
Copy link
Contributor Author

There don't appear to be any objections to this, and there is plenty of time until the next release, and so I am going to get this in before it develops merge conflicts. We can continue to iterate from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants