Refactor range/gen_series signature away from user defined#18317
Refactor range/gen_series signature away from user defined#18317alamb merged 4 commits intoapache:mainfrom
range/gen_series signature away from user defined#18317Conversation
| TypeSignatureClass::Binary if native_type.is_binary() => { | ||
| Ok(origin_type.to_owned()) | ||
| } | ||
| _ if native_type.is_null() => Ok(origin_type.to_owned()), |
There was a problem hiding this comment.
We were missing this even though we check for it in matches_native_type; I think this will allow passing null types to coercible signatures without needing to explicitly specify null type handling
| } | ||
|
|
||
| impl Range { | ||
| fn defined_signature() -> Signature { |
There was a problem hiding this comment.
This is the main change
| // Signature can only guarantee we get a date type, not specifically | ||
| // date32 so handle potential cast from date64 here. | ||
| let start = cast(start, &DataType::Date32)?; | ||
| let start = as_date32_array(&start)?; | ||
| let stop = cast(stop, &DataType::Date32)?; | ||
| let stop = as_date32_array(&stop)?; |
There was a problem hiding this comment.
I don't see much benefit in handling date64 natively anyway (given it's a bit of a weird type) so I think this is fine
| // Signature can only guarantee we get a timestamp type, not specifically | ||
| // timestamp(ns) so handle potential cast from other timestamps here. | ||
| fn cast_to_ns(arr: &ArrayRef) -> Result<ArrayRef> { | ||
| match arr.data_type() { | ||
| DataType::Timestamp(TimeUnit::Nanosecond, _) => Ok(Arc::clone(arr)), | ||
| DataType::Timestamp(_, tz) => Ok(cast( | ||
| arr, | ||
| &DataType::Timestamp(TimeUnit::Nanosecond, tz.to_owned()), | ||
| )?), | ||
| _ => unreachable!(), | ||
| } | ||
| } | ||
| let start = cast_to_ns(start)?; | ||
| let start = as_timestamp_nanosecond_array(&start)?; | ||
| let stop = cast_to_ns(stop)?; | ||
| let stop = as_timestamp_nanosecond_array(&stop)?; |
There was a problem hiding this comment.
I wonder if is better to instead try uplift the signature coercible API to allow specifying we only want timestamp(ns) (of any timezone) and let that handle coercion for us 🤔
There was a problem hiding this comment.
I agree that would be nicer but I think it could also be done as a follow on PR
| range(arrow_cast(1, 'Int8'), 5, -1), | ||
| range(arrow_cast(1, 'Int16'), arrow_cast(-5, 'Int8'), 1), | ||
| range(arrow_cast(1, 'Int32'), arrow_cast(-5, 'Int16'), arrow_cast(-1, 'Int8')), | ||
| range(DATE '1992-09-01', DATE '1993-03-01', arrow_cast('1 MONTH', 'Interval(YearMonth)')), | ||
| range(DATE '1993-02-01', arrow_cast(DATE '1993-01-01', 'Date64'), INTERVAL '-1' DAY), | ||
| range(arrow_cast(DATE '1989-04-01', 'Date64'), DATE '1993-03-01', INTERVAL '1' YEAR), | ||
| range(arrow_cast(DATE '1993-03-01', 'Date64'), arrow_cast(DATE '1989-04-01', 'Date64'), INTERVAL '1' YEAR) |
There was a problem hiding this comment.
Uplift tests to ensure coercion is handled properly for different types
There was a problem hiding this comment.
I recommend making this more explicit by:
- Revert the change to the existing tests
- Add a new query right below this one that adds the casts to the other integer types
| query error DataFusion error: Internal error: Unexpected argument type for generate_series : Date32 | ||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series(arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'), DATE '2021-01-02', INTERVAL '1' HOUR); | ||
|
|
||
| ## mixing types not allowed even if an argument is null | ||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series(TIMESTAMP '1992-09-01', DATE '1993-03-01', NULL); | ||
|
|
||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series(1, '2024-01-01', '2025-01-02'); | ||
|
|
||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series('2024-01-01'::timestamp, '2025-01-02', interval '1 day'); |
There was a problem hiding this comment.
So these are the original examples from the issue; technically we still return an internal error, so it feels a bit wrong to claim we close #15881 (well its an internal error nested in a plan error apparently)
However this internal error comes from the function signature related code, so we can fix it there and it should propagate to here (aka its not a specific issue with generate_series)
There was a problem hiding this comment.
Yes, I agree we should fix it in the signature related code
I think it would also be ideal if we can provide more details on what went wrong and how we can fix it (aka as a user who doesn't understand the internal implementation, if I get this error message, how do I fix it?)
| }; | ||
| } | ||
|
|
||
| macro_rules! singleton_variant { |
There was a problem hiding this comment.
Could you potentially add some documentation here about what is different about this macro compared to the others in this module?
|
|
||
| // Signature can only guarantee we get a timestamp type, not specifically | ||
| // timestamp(ns) so handle potential cast from other timestamps here. | ||
| fn cast_to_ns(arr: &ArrayRef) -> Result<ArrayRef> { |
There was a problem hiding this comment.
given cast already checks for the same type, I wonder if this is necessary? Or is the idea to cast to nanoseconds and maintain the timezone?
There was a problem hiding this comment.
Yeah it's to cast from potentially other timestamp time units down to nanoseconds
| // Signature can only guarantee we get a timestamp type, not specifically | ||
| // timestamp(ns) so handle potential cast from other timestamps here. | ||
| fn cast_to_ns(arr: &ArrayRef) -> Result<ArrayRef> { | ||
| match arr.data_type() { | ||
| DataType::Timestamp(TimeUnit::Nanosecond, _) => Ok(Arc::clone(arr)), | ||
| DataType::Timestamp(_, tz) => Ok(cast( | ||
| arr, | ||
| &DataType::Timestamp(TimeUnit::Nanosecond, tz.to_owned()), | ||
| )?), | ||
| _ => unreachable!(), | ||
| } | ||
| } | ||
| let start = cast_to_ns(start)?; | ||
| let start = as_timestamp_nanosecond_array(&start)?; | ||
| let stop = cast_to_ns(stop)?; | ||
| let stop = as_timestamp_nanosecond_array(&stop)?; |
There was a problem hiding this comment.
I agree that would be nicer but I think it could also be done as a follow on PR
| range(arrow_cast(1, 'Int8'), 5, -1), | ||
| range(arrow_cast(1, 'Int16'), arrow_cast(-5, 'Int8'), 1), | ||
| range(arrow_cast(1, 'Int32'), arrow_cast(-5, 'Int16'), arrow_cast(-1, 'Int8')), | ||
| range(DATE '1992-09-01', DATE '1993-03-01', arrow_cast('1 MONTH', 'Interval(YearMonth)')), | ||
| range(DATE '1993-02-01', arrow_cast(DATE '1993-01-01', 'Date64'), INTERVAL '-1' DAY), | ||
| range(arrow_cast(DATE '1989-04-01', 'Date64'), DATE '1993-03-01', INTERVAL '1' YEAR), | ||
| range(arrow_cast(DATE '1993-03-01', 'Date64'), arrow_cast(DATE '1989-04-01', 'Date64'), INTERVAL '1' YEAR) |
There was a problem hiding this comment.
I recommend making this more explicit by:
- Revert the change to the existing tests
- Add a new query right below this one that adds the casts to the other integer types
| query error DataFusion error: Internal error: Unexpected argument type for generate_series : Date32 | ||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series(arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'), DATE '2021-01-02', INTERVAL '1' HOUR); | ||
|
|
||
| ## mixing types not allowed even if an argument is null | ||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series(TIMESTAMP '1992-09-01', DATE '1993-03-01', NULL); | ||
|
|
||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series(1, '2024-01-01', '2025-01-02'); | ||
|
|
||
| query error DataFusion error: Error during planning: Internal error: Function 'generate_series' failed to match any signature | ||
| select generate_series('2024-01-01'::timestamp, '2025-01-02', interval '1 day'); |
There was a problem hiding this comment.
Yes, I agree we should fix it in the signature related code
I think it would also be ideal if we can provide more details on what went wrong and how we can fix it (aka as a user who doesn't understand the internal implementation, if I get this error message, how do I fix it?)
…e#18317) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#15881 - See my notes below ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Trying to move away from user defined signatures where possible; mainly to ensure consistency of error checking/messages. The original issue is because the function has to do this checking itself leading to inconsistency of error used (ideally shouldn't be internal). By uplifting away from a user defined signature we can make use of existing code meant to handle this checking and error messages for us. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Defined range/generate_series signature via coercible API instead of being user defined. Some accompanying changes are needed in the signature code to make this possible. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added SLT tests and fixed any existing ones. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No (error messages do change though) <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…e#18317) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#15881 - See my notes below ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Trying to move away from user defined signatures where possible; mainly to ensure consistency of error checking/messages. The original issue is because the function has to do this checking itself leading to inconsistency of error used (ideally shouldn't be internal). By uplifting away from a user defined signature we can make use of existing code meant to handle this checking and error messages for us. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Defined range/generate_series signature via coercible API instead of being user defined. Some accompanying changes are needed in the signature code to make this possible. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added SLT tests and fixed any existing ones. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No (error messages do change though) <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
generate_series: Internal error: could not cast array of type Date32 #15881Rationale for this change
Trying to move away from user defined signatures where possible; mainly to ensure consistency of error checking/messages. The original issue is because the function has to do this checking itself leading to inconsistency of error used (ideally shouldn't be internal). By uplifting away from a user defined signature we can make use of existing code meant to handle this checking and error messages for us.
What changes are included in this PR?
Defined range/generate_series signature via coercible API instead of being user defined. Some accompanying changes are needed in the signature code to make this possible.
Are these changes tested?
Added SLT tests and fixed any existing ones.
Are there any user-facing changes?
No (error messages do change though)