Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 8, 2020

This is a proposed approach - if people like this approach I will write proper tests (to ensure that the cast kernel support and this function remain in sync).

This PR brings DataFusion to parity with the type casting supported by arrow and allows DataFusion to plan all casts that are supported by the arrow cast kernel

Previously the notions of coercion and casting were somewhat conflated. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I personally want this functionality so when I add support for DictionaryArray casts in Arrow (#8346) they can also be used in DataFusion.

Codewise, I am concerned about the duplication in logic between this and cast.rs. However, the test I have in mind will ensure they don't get out of sync

Questions for reviewers:

  1. Would this code / test be better to include in the arrow cast kernel (cast.rs) itself? To keep the "support" matrix code closer to the implementation?

FYI @jorgecarleitao

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is to use a common can_cast_types function to detect valid casts at plan time, and make can_cast_types consistent with the arrow cast kernel

@github-actions
Copy link

github-actions bot commented Oct 8, 2020

@alamb alamb force-pushed the alamb/ARROW-10236-casting-rules branch from 3f05ca9 to 4ee37c3 Compare October 8, 2020 15:07
@alamb alamb changed the title ARROW-10236: Unify type casting logic in DataFusion ARROW-10236: [Rust][DataFusion] Unify type casting logic in DataFusion Oct 8, 2020
}

// Note this is meant to mirror the structure in arrow/src/compute/kernels/cast.rs
match (from_type, to_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I have been thinking about: in DataFusion, because we use dynamically typed arrays, we often have to annotate which types are supported by each arrow kernel / operation.

Thus, we need to duplicate these matches to enumerate the valid types accepted by the kernels, so that we error during planning.

I wonder if these functions shouldn't be closer to the implementation (i.e. in the arrow::compute::kernels), or, alternatively, if we could find an idiom that would allow us to write these matches one time (again, in compute::kernel).

It seems to me that the pattern is:

for compute

match array.data_type() {
   ... => compute(array)
}

for datatypes:

match datatype {
    ... => return Some(return_datatype)
}

one idea would be to use

match datatype {
    ... => return Some((closure_that_computes, return_datatype))
}

so that both use-cases could be written in a single match (and reduce the risk of mis-typing / change in one place without a change in another place).

This comment is not specific to this PR, which I need to go through: I was just curious about your thoughts on this, since you have been touching in a couple of these recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao -- I think the idea of using a single match in cast.rs is a good one. I'll try and whip up a proposed PR with that pattern for discussion (though probably not until tomorrow morning US Eastern time, or the weekend)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also going to suggest if we move can_cast_types to Arrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao and @nevi-me -- here is the the pattern I came up with:

pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> {
    let from_type = array.data_type();
    let func = get_cast_function(from_type, to_type)?;
    func(array)
}

/// Returns true if the cast from `array.array` type is possible, false otherwise
pub fn can_cast(array: &ArrayRef, to_type: &DataType, ) -> bool {
    let from_type = array.data_type();
    get_cast_function(from_type, to_type).is_ok()
}


type CastFunction = Fn(&ArrayRef) -> Result<ArrayRef>;

/// Returns a function that can cast `array` to `to_type`.
///
/// All type checking for supported types should be done in
/// get_cast_func itself. Thus, if Ok(func) is returned, func(array)
/// should be able to succeed for arrays. In other words, Err(_)
/// should be returned by `get_cast_function`, NOT `func(array)` if
/// the types are incompatible.
///
/// can_cast relies on this functionality
fn get_cast_function(from_type: &DataType, to_type: &DataType) -> Result<CastFunction> {
    use DataType::*;

    // clone array if types are the same
    if from_type == to_type {
        return Ok(|array| { Ok(array.clone()) })
    }

    match (from_type, to_type) {
        (Struct(_), _) => Err(ArrowError::ComputeError(
            "Cannot cast from struct to other types".to_string(),
        )),
        (_, Struct(_)) => Err(ArrowError::ComputeError(
            "Cannot cast to struct from other types".to_string(),
        )),
        (List(_), List(ref to)) => Ok(|array| {
            let data = array.data_ref();
            let underlying_array = make_array(data.child_data()[0].clone());
            let cast_array = cast(&underlying_array, &to)?;
            let array_data = ArrayData::new(
                *to.clone(),
                array.len(),
                Some(cast_array.null_count()),
                cast_array
                    .data()
                    .null_bitmap()
                    .clone()
                    .map(|bitmap| bitmap.bits),
                array.offset(),
                // reuse offset buffer
                data.buffers().to_vec(),
                vec![cast_array.data()],
            );
            let list = ListArray::from(Arc::new(array_data));
            Ok(Arc::new(list) as ArrayRef)
        }),
        (List(_), _) => Err(ArrowError::ComputeError(
            "Cannot cast list to non-list data types".to_string(),
        )),
        (_, List(ref to)) => {
            // cast primitive to list's primitive
            let cast_func = get_cast_function(from_type, &to)?;
            Ok(move |array| {
                let cast_array = cast_func(array, &to)?;
                // create offsets, where if array.len() = 2, we have [0,1,2]
                let offsets: Vec<i32> = (0..=array.len() as i32).collect();
                let value_offsets = Buffer::from(offsets[..].to_byte_slice());
                let list_data = ArrayData::new(
                    *to.clone(),
                    array.len(),
                    Some(cast_array.null_count()),
                    cast_array
                        .data()
                        .null_bitmap()
                        .clone()
                        .map(|bitmap| bitmap.bits),
                    0,
                    vec![value_offsets],
                    vec![cast_array.data()],
                );
                let list_array = Arc::new(ListArray::from(Arc::new(list_data))) as ArrayRef;

                Ok(list_array)
            })
        }
...

While I think it would work, I am not sure I will have enough time.

Instead, I will get a PR up that moves the can_cast_types (with duplicate match) to arrow and then try a separate refactoring PR to unify the two match arms

}
}

fn is_numeric_type(t: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also something that can live in perhaps arrow::util or arrow::datatype (and be implemented as a trait member of DataType).
I needed to check if a datatype was listy or stringy, and the boilerplate was tedious.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is ready as there is a section missing on the tests, but so far this looks really good to me!

I have no specific comments apart from the idea of bringing that code to arrow itself. I would also be fine with doing it in another PR.

@alamb
Copy link
Contributor Author

alamb commented Oct 12, 2020

Thanks for the nudge @jorgecarleitao -- my plan is to move the code into arrow itself (and I was going to try some ideas to avoid the duplicate case statement). I haven't had as much time to work on this as I had hoped recently.

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2020

To keep the PR list clean, I will close this one and open a new PR with the alternate proposed implementation

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2020

#8460 is the follow on

nevi-me pushed a commit that referenced this pull request Oct 16, 2020
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from #8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in #8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
kszucs pushed a commit that referenced this pull request Oct 19, 2020
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from #8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in #8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@alamb alamb deleted the alamb/ARROW-10236-casting-rules branch October 26, 2020 20:04
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from apache/arrow#8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in apache/arrow#8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in apache/arrow#8399 and this PR. See also apache/arrow#8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see apache/arrow#8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants