Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 14, 2020

This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400

It adds

  1. a can_cast_types function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in ARROW-10236: [Rust][DataFusion] Unify type casting logic in DataFusion #8400 (comment)) that encodes the valid type casting
  2. A test that ensures can_cast_types and cast remain in sync
  3. Bug fixes that the test above uncovered (I'll comment inline)
  4. Change DataFuson to use can_cast_types so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support DictionaryArray.

Codewise, I am concerned about the duplication in logic between the match statements in cast and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

// temporal casts
(Int32, Date32(_)) => cast_array_data::<Date32Type>(array, to_type.clone()),
(Int32, Time32(_)) => cast_array_data::<Date32Type>(array, to_type.clone()),
(Int32, Time32(TimeUnit::Second)) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to cast Int32 to a Time32(Microsecond) or Time32(Nanosecond)

let to_size = MILLISECONDS;
if from_size != to_size {

// Scale time_array by (to_size / from_size) using a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this code, casting from a timestamp 32 -> Date64 would result in a divide by zero error (as from_size / to_size was 1 / 1000 == 0

}

/// Returns true if this type is numeric: (UInt*, Unit*, or Float*)
pub fn is_numeric(t: &DataType) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by @nevi-me on #8400 (comment)

@github-actions
Copy link

let schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]);
let result = cast(col("a"), &schema, DataType::Int32);
result.expect_err("Invalid CAST from Utf8 to Int32");
// Ensure a useful error happens at plan time if invalid casts are used
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that arrow can, in fact, cast from utf8 -> Int32

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, unifying the cast logic was good for this. I've wanted to add cast options, such as disallowing lossy casts.
If/when we get to that point, we'll have to think about what behaviour we want DataFusion to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when we get to that point, we'll have to think about what behaviour we want DataFusion to use.

I think DataFusion now makes the distinction between "casting" (aka if the user specifically requests to cast from one type to another) which can be lossy and "coercion" (aka when casts need to be added explicitly so that expressions can be evaluated (e.g. plus).

Coercion is designed not be lossy, but casting can be.

The downside is that Datafusion has to have another set of rules of what type coercions are allowed (e.g. I need to add #8463 to properly support DictionaryArray in DataFusion).

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb I had tried fixing up the casting vs coercion logic a couple of times in the past and it's great to see this get cleaned up.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot!

@nevi-me nevi-me closed this in 1d10f22 Oct 16, 2020
kszucs pushed a commit that referenced this pull request Oct 19, 2020
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from #8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in #8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@alamb alamb deleted the alamb/ARROW-10236-casting-rules-2 branch October 26, 2020 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants