Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 5, 2020

This PR removes the explicit plan-time cast support check from DataFusion and instead uses the runtime check of the arrow compute cast kernel.

The upside of this change is less code overall and increased casting support (DataFusion now supports all casting supported by Arrow rather than just the subset that was hard coded in can_coerce_from).

A potential downside is that an unsupported cast will produce an error later in the process (after the plan has started running rather than during plan time).

This PR is part of a set of changes to support DictionaryArrays in DataFusion. Once I add the appropriate support to the cast kernel, DataFusion will be able to cast DictionaryArray types correctly with no additional code.

Prior to this change, the test in sql failed with this error message:

---- query_on_string_dictionary stdout ----
thread 'query_on_string_dictionary' panicked at 'called `Result::unwrap()` on an `Err` value: General("\'Dictionary(Int32, Utf8) = Utf8\' can\'t be evaluated because there isn\'t a common type to coerce the types to")', datafusion/tests/sql.rs:925:16
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@github-actions
Copy link

github-actions bot commented Oct 5, 2020

/// `result[row][column]`
async fn execute(ctx: &mut ExecutionContext, sql: &str) -> Vec<Vec<String>> {
let plan = ctx.create_logical_plan(&sql).unwrap();
let msg = format!("Creating logical plan for '{}'", sql);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some changes to improve the test error reporting (rather than a straight up panic, some diagnostic information is printed as well)

}

#[tokio::test]
async fn query_on_string_dictionary() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the end to end testcast I am working on for DictionaryArray support -- with this PR we can do basic filtering in DataFusion. There are a few more PRs needed to complete expressions and aggregation, but they are coming.


// The following queries are not yet supported

// // filtering with constant
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have PRs in the works to support these cases and I will uncomment them as I do so.

@alamb alamb force-pushed the alamb/ARROW-10165-coercion-support branch from 23e2673 to 1885319 Compare October 6, 2020 16:38
@alamb alamb marked this pull request as ready for review October 6, 2020 16:40
/// # Errors
///
/// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
/// Currently no errors happen at plan time. If it is impossible
Copy link
Member

@jorgecarleitao jorgecarleitao Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure on whether this is advisable:

Doesn't this mean that the plan can fail arbitrarily when a user performs an impossible cast? This can happen like 10hs after the execution starts, when the query finally reaches that point.

I though that the point of having the types available during planning, both logical and physical, was to perform exactly these types of checks. Removing this check seems a regression to me, on which an impossible cast will only be caught during execution.

Isn't it possible to expand can_coerce_from to cover the cases that DataFusion is missing but that arrow cast kernel supports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that there appear to be no uses (or tests) for this function -- https://github.com/apache/arrow/search?q=cast_to I found this quite confusing

@jorgecarleitao what do you think about simply removing the whole thing entirely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I was confused: can_coerce_from checks whether we can perform a lossless cast. This is different from the ability to perform any cast.

Therefore, I think that the logic we may want is:

  • the Expr::Cast operation should check that we can perform the cast at all (e.g. a ListArray to a UInt32Array should not be allowed). This should match the casts that the kernel supports.
  • implicit casts, such as the ones that we perform when we pass arguments to functions to try to match the signature, should use can_coerce_from, as we do not want to perform lossless implicit casts.

AFAIK casts to and from dictionaries are lossless, and thus they can happen at any point in the physical planning. Since all arrays support dictionaries, I guess we only have to error if the implementation is still not available.

What do you think, @alamb ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao -- I think this makes sense. I was confused about the intent of can_coerce_from (aka that it is for lossless conversions only). With this explanation let me go back and give adding dictionary coercion / cast a second pass (though likely not until tomorrow, Thursday, US Eastern time)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao -- here is a proposed alternate approach: #8400

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2020

@jorgecarleitao

Doesn't this mean that the plan can fail arbitrarily when a user performs an impossible cast? This can happen like 10hs after the execution starts, when the query finally reaches that point.

Yes I think that is accurate.

Isn't it possible to expand can_coerce_from to cover the cases that DataFusion is missing but that arrow cast kernel supports?

Yes I can do that -- my rationale for this proposal was to avoiding having to keep two lists of valid typecasts in sync. However, the value of faster errors is a valid point. I will make a new PR to do as you suggest and expand the coverage in can_coerce_from

@alamb alamb closed this Oct 7, 2020
let this_type = self.get_type(schema)?;
if this_type == *cast_to_type {
Ok(self.clone())
} else if can_coerce_from(cast_to_type, &this_type) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself is pretty simple -- it just deletes code :)

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2020

Alternate approach is in #8359

@alamb alamb deleted the alamb/ARROW-10165-coercion-support branch October 8, 2020 14:06
jorgecarleitao pushed a commit that referenced this pull request Oct 8, 2020
The code / comments for type coercion are a little confusing and don't make the distinction between coercion and casting clear – this PR attempts to clarify the intent, channeling the information from @jorgecarleitao  here: #8340 (comment)

Closes #8399 from alamb/alamb/coercion-docs

Authored-by: alamb <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
nevi-me pushed a commit that referenced this pull request Oct 16, 2020
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from #8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in #8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
kszucs pushed a commit that referenced this pull request Oct 19, 2020
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from #8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in #8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…ataFusion

This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from apache/arrow#8400

It adds
1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in apache/arrow#8400 (comment)) that encodes the valid type casting
2. A test that ensures `can_cast_types` and `cast` remain in sync
3. Bug fixes that the test above uncovered (I'll comment inline)
4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows

Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in apache/arrow#8399 and this PR. See also apache/arrow#8340 (comment) for more discussion.

I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.

Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see apache/arrow#8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync.

Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2

Authored-by: alamb <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants