-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel #8340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// `result[row][column]` | ||
| async fn execute(ctx: &mut ExecutionContext, sql: &str) -> Vec<Vec<String>> { | ||
| let plan = ctx.create_logical_plan(&sql).unwrap(); | ||
| let msg = format!("Creating logical plan for '{}'", sql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are some changes to improve the test error reporting (rather than a straight up panic, some diagnostic information is printed as well)
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn query_on_string_dictionary() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the end to end testcast I am working on for DictionaryArray support -- with this PR we can do basic filtering in DataFusion. There are a few more PRs needed to complete expressions and aggregation, but they are coming.
|
|
||
| // The following queries are not yet supported | ||
|
|
||
| // // filtering with constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have PRs in the works to support these cases and I will uncomment them as I do so.
…ng checks in favor of Arrow cast kernel
23e2673 to
1885319
Compare
| /// # Errors | ||
| /// | ||
| /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType]. | ||
| /// Currently no errors happen at plan time. If it is impossible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure on whether this is advisable:
Doesn't this mean that the plan can fail arbitrarily when a user performs an impossible cast? This can happen like 10hs after the execution starts, when the query finally reaches that point.
I though that the point of having the types available during planning, both logical and physical, was to perform exactly these types of checks. Removing this check seems a regression to me, on which an impossible cast will only be caught during execution.
Isn't it possible to expand can_coerce_from to cover the cases that DataFusion is missing but that arrow cast kernel supports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting that there appear to be no uses (or tests) for this function -- https://github.com/apache/arrow/search?q=cast_to I found this quite confusing
@jorgecarleitao what do you think about simply removing the whole thing entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I was confused: can_coerce_from checks whether we can perform a lossless cast. This is different from the ability to perform any cast.
Therefore, I think that the logic we may want is:
- the
Expr::Castoperation should check that we can perform the cast at all (e.g. aListArrayto aUInt32Arrayshould not be allowed). This should match the casts that the kernel supports. - implicit casts, such as the ones that we perform when we pass arguments to functions to try to match the signature, should use
can_coerce_from, as we do not want to perform lossless implicit casts.
AFAIK casts to and from dictionaries are lossless, and thus they can happen at any point in the physical planning. Since all arrays support dictionaries, I guess we only have to error if the implementation is still not available.
What do you think, @alamb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao -- I think this makes sense. I was confused about the intent of can_coerce_from (aka that it is for lossless conversions only). With this explanation let me go back and give adding dictionary coercion / cast a second pass (though likely not until tomorrow, Thursday, US Eastern time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao -- here is a proposed alternate approach: #8400
Yes I think that is accurate.
Yes I can do that -- my rationale for this proposal was to avoiding having to keep two lists of valid typecasts in sync. However, the value of faster errors is a valid point. I will make a new PR to do as you suggest and expand the coverage in |
| let this_type = self.get_type(schema)?; | ||
| if this_type == *cast_to_type { | ||
| Ok(self.clone()) | ||
| } else if can_coerce_from(cast_to_type, &this_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself is pretty simple -- it just deletes code :)
|
Alternate approach is in #8359 |
The code / comments for type coercion are a little confusing and don't make the distinction between coercion and casting clear – this PR attempts to clarify the intent, channeling the information from @jorgecarleitao here: #8340 (comment) Closes #8399 from alamb/alamb/coercion-docs Authored-by: alamb <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in #8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from #8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in #8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in #8399 and this PR. See also #8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see #8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…ataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from apache/arrow#8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in apache/arrow#8400 (comment)) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in apache/arrow#8399 and this PR. See also apache/arrow#8340 (comment) for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see apache/arrow#8400 (comment)), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This PR removes the explicit plan-time cast support check from DataFusion and instead uses the runtime check of the arrow compute cast kernel.
The upside of this change is less code overall and increased casting support (DataFusion now supports all casting supported by Arrow rather than just the subset that was hard coded in
can_coerce_from).A potential downside is that an unsupported cast will produce an error later in the process (after the plan has started running rather than during plan time).
This PR is part of a set of changes to support
DictionaryArrays in DataFusion. Once I add the appropriate support to thecastkernel, DataFusion will be able to castDictionaryArraytypes correctly with no additional code.Prior to this change, the test in sql failed with this error message: