-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability for Maps to cast to another case where the field names are different #5703
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @HawaiianSpork -- this looks pretty close to me
Some high level comments:
- I think we should try to avoid adding methods to
DataType
as they proposed ones are somewhat specialized andDataType
is a pub type anyways (so its implementation is not encapsulated) - The PR title says "where field names are different" but this PR not only supports different field names but also casts the key and value types as well, I think
- There are a few more tests I think are needed (error cases and Nulls)
All in all I think this PR is quite nice. Thanks again
arrow-cast/src/cast/mod.rs
Outdated
&entry_offsets, | ||
).unwrap(); | ||
|
||
let new_type = DataType::Map(Arc::new(Field::new("entries_new", DataType::Struct( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add an assert here on what the type of array
was (to show it wasn't the same as new-type
)?
arrow-cast/src/cast/mod.rs
Outdated
@@ -7100,6 +7116,54 @@ mod tests { | |||
FixedSizeListArray::from(list_data) | |||
} | |||
|
|||
#[test] | |||
fn test_cast_map_containers() { | |||
let keys = vec!["0", "1", "5", "6", "7"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add test coverage for None
(aka when the keys / values also have null values)?
I think it might also make this test clearer if you used MapBuilder
to construct the inputs: https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html
arrow-schema/src/datatype.rs
Outdated
@@ -670,6 +670,42 @@ impl DataType { | |||
pub fn new_fixed_size_list(data_type: DataType, size: i32, nullable: bool) -> Self { | |||
DataType::FixedSizeList(Arc::new(Field::new_list_field(data_type, nullable)), size) | |||
} | |||
|
|||
/// Gets the key field in a map data type. For all other types returns None. | |||
pub fn key_field(&self) -> Option<FieldRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are already matching on DataType::Map
above you'll already have entries_field
available -- so we could avoid adding this new API to Datatype
.collect::<Vec<_>>(); | ||
assert_eq!(&values_string, &vec!["test_val_1", "test_val_2", "long_test_val_3", "4", "test_val_5"]); | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please add a negative tests for:
- not being able to cast two
MapArray
s with different values forordered
? - Two MapArrays where the values can not be cast (maybe where the values are
Interval
s and trying to cast to another map array where the values areDuration
s)
arrow-cast/src/cast/mod.rs
Outdated
@@ -710,6 +718,14 @@ pub fn cast_with_options( | |||
(_, FixedSizeList(ref to, size)) if *size == 1 => { | |||
cast_values_to_fixed_size_list(array, to, *size, cast_options) | |||
} | |||
(Map(_, ordered1), Map(_, ordered2)) if ordered1 == ordered2 => | |||
cast_map_values( | |||
array.as_map_opt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could simply use as_map()
given this is in the match
clause when data type was a map. As in there is no reason to return this error as it is not expected and panic'ing would be consistent with the rest of the codebase if the data type didn't match the array
arrow-cast/src/cast/mod.rs
Outdated
@@ -157,6 +159,12 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool { | |||
(_, LargeList(list_to)) => can_cast_types(from_type, list_to.data_type()), | |||
(_, FixedSizeList(list_to,size)) if *size == 1 => { | |||
can_cast_types(from_type, list_to.data_type())}, | |||
(Map(_,ordered_from), Map(_, ordered_to)) if ordered_from == ordered_to => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below -- I think we could avoid key_type/value_type here
arrow-cast/src/cast/map.rs
Outdated
StructArray::new( | ||
Fields::from(vec![key_field, value_field]), | ||
vec![key_array, value_array], | ||
from.nulls().cloned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be from.entries().nulls().cloned()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, I will fix.
…s are different. Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data. This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values. This can be helpful in merging record batches that may have come from different sources. This also makes maps behave similar to lists which also have a field to distinguish their elements.
Co-authored-by: Andrew Lamb <[email protected]>
- simplify map casting logic to reuse the entries - Added unit tests for negative cases - Use MapBuilder to make the intended type clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @HawaiianSpork and @linhr for the review. This PR looks good to me
Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data.
This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values.
This can be helpful in merging record batches that may have come from different sources. This also makes maps behave similar to lists which also have a field to distinguish their elements.
Which issue does this PR close?
Closes #5702.
Rationale for this change
Arrow has field names for list elements and map elements. arrow-rs can cast lists with one name for elements (like elements) to another name like `items). But this is not supported for the map type which has field names, one for the elements, one for the keys and one for the values. This can be a limitation for anyone using arrays of Maps from different sources.
What changes are included in this PR?
The cast functions now support casting from one Map type to another Map type where nothing is different other than the root field elements name, the key field name and value field name.
Are there any user-facing changes?
No, I would not consider this a user breaking change unless some client was expecting an error when map root level field names did not match.