-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10236: [Rust][DataFusion] Unify type casting logic in DataFusion #8400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea is to use a common can_cast_types function to detect valid casts at plan time, and make can_cast_types consistent with the arrow cast kernel
3f05ca9 to
4ee37c3
Compare
| } | ||
|
|
||
| // Note this is meant to mirror the structure in arrow/src/compute/kernels/cast.rs | ||
| match (from_type, to_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that I have been thinking about: in DataFusion, because we use dynamically typed arrays, we often have to annotate which types are supported by each arrow kernel / operation.
Thus, we need to duplicate these matches to enumerate the valid types accepted by the kernels, so that we error during planning.
I wonder if these functions shouldn't be closer to the implementation (i.e. in the arrow::compute::kernels), or, alternatively, if we could find an idiom that would allow us to write these matches one time (again, in compute::kernel).
It seems to me that the pattern is:
for compute
match array.data_type() {
... => compute(array)
}
for datatypes:
match datatype {
... => return Some(return_datatype)
}
one idea would be to use
match datatype {
... => return Some((closure_that_computes, return_datatype))
}
so that both use-cases could be written in a single match (and reduce the risk of mis-typing / change in one place without a change in another place).
This comment is not specific to this PR, which I need to go through: I was just curious about your thoughts on this, since you have been touching in a couple of these recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao -- I think the idea of using a single match in cast.rs is a good one. I'll try and whip up a proposed PR with that pattern for discussion (though probably not until tomorrow morning US Eastern time, or the weekend)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also going to suggest if we move can_cast_types to Arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao and @nevi-me -- here is the the pattern I came up with:
pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> {
let from_type = array.data_type();
let func = get_cast_function(from_type, to_type)?;
func(array)
}
/// Returns true if the cast from `array.array` type is possible, false otherwise
pub fn can_cast(array: &ArrayRef, to_type: &DataType, ) -> bool {
let from_type = array.data_type();
get_cast_function(from_type, to_type).is_ok()
}
type CastFunction = Fn(&ArrayRef) -> Result<ArrayRef>;
/// Returns a function that can cast `array` to `to_type`.
///
/// All type checking for supported types should be done in
/// get_cast_func itself. Thus, if Ok(func) is returned, func(array)
/// should be able to succeed for arrays. In other words, Err(_)
/// should be returned by `get_cast_function`, NOT `func(array)` if
/// the types are incompatible.
///
/// can_cast relies on this functionality
fn get_cast_function(from_type: &DataType, to_type: &DataType) -> Result<CastFunction> {
use DataType::*;
// clone array if types are the same
if from_type == to_type {
return Ok(|array| { Ok(array.clone()) })
}
match (from_type, to_type) {
(Struct(_), _) => Err(ArrowError::ComputeError(
"Cannot cast from struct to other types".to_string(),
)),
(_, Struct(_)) => Err(ArrowError::ComputeError(
"Cannot cast to struct from other types".to_string(),
)),
(List(_), List(ref to)) => Ok(|array| {
let data = array.data_ref();
let underlying_array = make_array(data.child_data()[0].clone());
let cast_array = cast(&underlying_array, &to)?;
let array_data = ArrayData::new(
*to.clone(),
array.len(),
Some(cast_array.null_count()),
cast_array
.data()
.null_bitmap()
.clone()
.map(|bitmap| bitmap.bits),
array.offset(),
// reuse offset buffer
data.buffers().to_vec(),
vec![cast_array.data()],
);
let list = ListArray::from(Arc::new(array_data));
Ok(Arc::new(list) as ArrayRef)
}),
(List(_), _) => Err(ArrowError::ComputeError(
"Cannot cast list to non-list data types".to_string(),
)),
(_, List(ref to)) => {
// cast primitive to list's primitive
let cast_func = get_cast_function(from_type, &to)?;
Ok(move |array| {
let cast_array = cast_func(array, &to)?;
// create offsets, where if array.len() = 2, we have [0,1,2]
let offsets: Vec<i32> = (0..=array.len() as i32).collect();
let value_offsets = Buffer::from(offsets[..].to_byte_slice());
let list_data = ArrayData::new(
*to.clone(),
array.len(),
Some(cast_array.null_count()),
cast_array
.data()
.null_bitmap()
.clone()
.map(|bitmap| bitmap.bits),
0,
vec![value_offsets],
vec![cast_array.data()],
);
let list_array = Arc::new(ListArray::from(Arc::new(list_data))) as ArrayRef;
Ok(list_array)
})
}
...While I think it would work, I am not sure I will have enough time.
Instead, I will get a PR up that moves the can_cast_types (with duplicate match) to arrow and then try a separate refactoring PR to unify the two match arms
| } | ||
| } | ||
|
|
||
| fn is_numeric_type(t: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also something that can live in perhaps arrow::util or arrow::datatype (and be implemented as a trait member of DataType).
I needed to check if a datatype was listy or stringy, and the boilerplate was tedious.
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is ready as there is a section missing on the tests, but so far this looks really good to me!
I have no specific comments apart from the idea of bringing that code to arrow itself. I would also be fine with doing it in another PR.
|
Thanks for the nudge @jorgecarleitao -- my plan is to move the code into arrow itself (and I was going to try some ideas to avoid the duplicate |
|
To keep the PR list clean, I will close this one and open a new PR with the alternate proposed implementation |
|
#8460 is the follow on |
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in #8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in #8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from apache/arrow#8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in apache/arrow#8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in apache/arrow#8399 and this PR. See also apache/arrow#8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see apache/arrow#8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a proposed approach - if people like this approach I will write proper tests (to ensure that the cast kernel support and this function remain in sync).
This PR brings DataFusion to parity with the type casting supported by arrow and allows DataFusion to plan all casts that are supported by the arrow cast kernel
Previously the notions of coercion and casting were somewhat conflated. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.
I personally want this functionality so when I add support for
DictionaryArraycasts in Arrow (#8346) they can also be used in DataFusion.Codewise, I am concerned about the duplication in logic between this and cast.rs. However, the test I have in mind will ensure they don't get out of sync
Questions for reviewers:
FYI @jorgecarleitao