-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion #8460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // temporal casts | ||
| (Int32, Date32(_)) => cast_array_data::<Date32Type>(array, to_type.clone()), | ||
| (Int32, Time32(_)) => cast_array_data::<Date32Type>(array, to_type.clone()), | ||
| (Int32, Time32(TimeUnit::Second)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible to cast Int32 to a Time32(Microsecond) or Time32(Nanosecond)
| let to_size = MILLISECONDS; | ||
| if from_size != to_size { | ||
|
|
||
| // Scale time_array by (to_size / from_size) using a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this code, casting from a timestamp 32 -> Date64 would result in a divide by zero error (as from_size / to_size was 1 / 1000 == 0
| } | ||
|
|
||
| /// Returns true if this type is numeric: (UInt*, Unit*, or Float*) | ||
| pub fn is_numeric(t: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested by @nevi-me on #8400 (comment)
| let schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]); | ||
| let result = cast(col("a"), &schema, DataType::Int32); | ||
| result.expect_err("Invalid CAST from Utf8 to Int32"); | ||
| // Ensure a useful error happens at plan time if invalid casts are used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that arrow can, in fact, cast from utf8 -> Int32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, unifying the cast logic was good for this. I've wanted to add cast options, such as disallowing lossy casts.
If/when we get to that point, we'll have to think about what behaviour we want DataFusion to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/when we get to that point, we'll have to think about what behaviour we want DataFusion to use.
I think DataFusion now makes the distinction between "casting" (aka if the user specifically requests to cast from one type to another) which can be lossy and "coercion" (aka when casts need to be added explicitly so that expressions can be evaluated (e.g. plus).
Coercion is designed not be lossy, but casting can be.
The downside is that Datafusion has to have another set of rules of what type coercions are allowed (e.g. I need to add #8463 to properly support DictionaryArray in DataFusion).
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb I had tried fixing up the casting vs coercion logic a couple of times in the past and it's great to see this get cleaned up.
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot!
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in #8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400
It adds
can_cast_typesfunction to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in ARROW-10236: [Rust][DataFusion] Unify type casting logic in DataFusion #8400 (comment)) that encodes the valid type castingcan_cast_typesandcastremain in synccan_cast_typesso that it plans casting consistently with what arrow allowsPreviously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.
I am adding this functionality so DataFusion gains rudimentary support
DictionaryArray.Codewise, I am concerned about the duplication in logic between the match statements in
castand `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.