Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with "to_date" failing to process dates later than year 2262 #12227

Merged
merged 11 commits into from
Sep 6, 2024

Conversation

MartinKolbAtWork
Copy link
Contributor

Which issue does this PR close?

Closes #12226

Rationale for this change

The "to_date" function (https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs) fails to process dates that are later than year 2262.
This is caused by the implementation detail that the conversion process uses nano-seconds based on epoch.
The Arrow datatype for Date32 and Date64 support much larger values. The statements in some areas in the code state that the usage of nanoseconds is imposed by the Date types of Arrow. This is simply wrong. The Date32 type stores the number of days since epoch. The Date64 type stores the milliseconds (NOT nanoseconds) since epoch. Both Date32 and Date64 can therefore massively exceed the year 2262.
See: https://arrow.apache.org/docs/cpp/api/datatype.html

NOTE:
Processing dates later than 2262 is not a theoretical issue. In widely used business software systems, unbounded dates (e.g. an "expiry_date" that is set to never expire) are set to the 31st of December of the year 9999 (i.e. "9999-12-31").
Processing such data with datafusion will fail if the current "to_date" implementation touches this data.

What changes are included in this PR?

  • Changing the implementation of "to_date" to use milli-seconds instead of nano-seconds in order to achieve compliance with the values supported on Arrow's Date32 and Date64
  • Unit tests to validate the correct behavior

Are these changes tested?

Unit tests are provided in https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs

Are there any user-facing changes?

n/a

Due to using nanoseconds as an intermediate state, the values for dates processed via "to_date" cannot be later than the year 2262. The Arrow datatype for Date32 and Date64 supports much larger values.

The statements in some areas in the code that the usage of nanoseconds is imposed by Arrow is simply wrong. The Date32 type stores the number of days since epoch. The Date64 type stores the milliseconds (NOT nanoseconds) since epoch. Both Date32 and Date64 can therefore massively exceed the year 2262.
See: https://arrow.apache.org/docs/cpp/api/datatype.html
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MartinKolbAtWork for your contribution

I just checked we do have an issue with to_date function whereas the cast works correctly

> select extract(day from cast('9999-12-31' as date));
+-------------------------------------------+
| date_part(Utf8("DAY"),Utf8("9999-12-31")) |
+-------------------------------------------+
| 31.0                                      |
+-------------------------------------------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> select to_date('9999-12-31');
Arrow error: Parser error: The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804

Perhaps we can use a the same implementation like CAST does(reuse it) instead of creating a new impl

@MartinKolbAtWork
Copy link
Contributor Author

MartinKolbAtWork commented Aug 30, 2024

Thanks @MartinKolbAtWork for your contribution

I just checked we do have an issue with to_date function whereas the cast works correctly

> select extract(day from cast('9999-12-31' as date));
+-------------------------------------------+
| date_part(Utf8("DAY"),Utf8("9999-12-31")) |
+-------------------------------------------+
| 31.0                                      |
+-------------------------------------------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> select to_date('9999-12-31');
Arrow error: Parser error: The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804

Perhaps we can use a the same implementation like CAST does(reuse it) instead of creating a new impl

The main thing of "to_date" is that is supports supplying a format string. As CAST does not support this, I don't think merging or re-using these two implementations is an approach that should be pursued. However with the latest commit I reduced the implementation effort by calling the parsing implementations from "arrow_cast".

@findepi
Copy link
Member

findepi commented Aug 30, 2024

@jayzhan211 can you please trigger the build?

@findepi
Copy link
Member

findepi commented Aug 30, 2024

@MartinKolbAtWork please check the build results.
for example

[Diff] (-expected|+actual)
-   2023-01-13
+   2023-01-14
at test_files/dates.slt:126

it might be the expected value was wrong in this case

@findepi
Copy link
Member

findepi commented Aug 30, 2024

Can you please update the docs?

Supported range for string input is between `1677-09-21` and `2262-04-11` exclusive. To parse dates outside of

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 30, 2024
@MartinKolbAtWork
Copy link
Contributor Author

Hi @findepi I fixed the test failures and updated the docs. You the build be re-triggered to check if this was successful?

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update! couple minor comments

datafusion/functions/src/datetime/to_date.rs Outdated Show resolved Hide resolved
datafusion/functions/src/datetime/to_date.rs Outdated Show resolved Hide resolved
datafusion/functions/src/datetime/to_date.rs Show resolved Hide resolved
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 31, 2024
@MartinKolbAtWork
Copy link
Contributor Author

Hi @findepi ,
thanks again for assisting me to get the PR in a “mergeable” state. Really appreciate your input. Your comment on the change regarding the timestamp behavior made me think that we might take one step back.
These changes related to the timestamp handling were added to support negative years. The initial target for this PR is fixing the problem of handling years greater than 2262. I found your suggestion about supporting negative years very compelling, because it semantically addresses a similar edge case.

Now that we found out that adding support for negative years does add several nitty-gritty details, that need to be carefully thought through, I suggest that we split this up and handle the negative years topic in a separate PR.
I reverted the changes back to the point where it’s only about supporting years up to 9999. This makes the code simpler and easier to understand.
Hope this makes sense for you.

@alamb
Copy link
Contributor

alamb commented Sep 1, 2024

fyi @Omega359

@Omega359
Copy link
Contributor

Omega359 commented Sep 1, 2024

I wonder what the cost is for using string_to_datetime_formatted vs a more limited string_to_date_formatted. There should be slightly less parsing involved for that.

@MartinKolbAtWork
Copy link
Contributor Author

I wonder what the cost is for using string_to_datetime_formatted vs a more limited string_to_date_formatted. There should be slightly less parsing involved for that.

Hi @Omega359 ,
thanks for asking. This is a good question. Let me throw in two aspects.

  1. The current implementation also uses a “complete” parsing, so the proposed change does not make anything worse. The change just uses milliseconds instead of nanoseconds as an intermediate value to avoid the limitation of the upper bound of year 2262 when using nanoseconds.

  2. Simply parsing the values for year, month, and day out of a timestamp will not always create correct value for a date. This is an example of an existing test case in DataFusion:

    SELECT to_date('01-14-2023 01:01:30+05:30', '%q', '%d-%m-%Y %H/%M/%S', '%+', '%m-%d-%Y %H:%M:%S%#z');
    ----
    2023-01-13

    The correct day of the timestamp 01-14-2023 01:01:30+05:30 is 13, although the “parsed day” is 14. This is because the timestamp refers to a time-zone where the local day is already the 14th, but the UTC day is the 13th.
    When parsing to a “timestamp”, it is expected that the “day” is “14” because the timestamp object can be asked for its time-zone, so the UTC date (13th) can be calculated from that available information. A simple “date” object however does not carry time-zone information. Therefore the date must refer to UTC, which is the 13th.

And, as a brief reminder… I just intended to fix the issue that the date calculation has an upper limit of 2262. Both, the full timestamp parsing as well as the UTC handling are already in the code-base. I introduced neither of them.

@Omega359
Copy link
Contributor

Omega359 commented Sep 2, 2024

And, as a brief reminder… I just intended to fix the issue that the date calculation has an upper limit of 2262. Both, the full timestamp parsing as well as the UTC handling are already in the code-base. I introduced neither of them.

Good points. I am unsure if I introduced that or just inherited it when I added format parsing and didn't think closely enough about it. Glad to see this being resolved

@MartinKolbAtWork
Copy link
Contributor Author

Hi @Omega359 , @findepi , @alamb ,
thanks for your suggestions for far. I'm not sure if there anything is left that you want me to change, or do we consider the current state to be ready for getting an approving review?

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

thanks for your suggestions for far. I'm not sure if there anything is left that you want me to change, or do we consider the current state to be ready for getting an approving review?

I will check this PR out shortly

@MartinKolbAtWork
Copy link
Contributor Author

thanks for your suggestions for far. I'm not sure if there anything is left that you want me to change, or do we consider the current state to be ready for getting an approving review?

I will check this PR out shortly

@alamb , take your time.
I just asked to make sure that I did not miss any action that someone is waiting for.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice (and well tested to me). THank you very much for the contribution @MartinKolbAtWork

Also, thank you @findepi and @comphead for the review

cc @Omega359


use super::ToDateFunc;

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Very nice

"2020-09-08 12:13:29",
];
for date_str in test_cases {
let formatted_date_scalar = ScalarValue::Utf8(Some(date_str.into()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think you can write the same thing more concisely like

Suggested change
let formatted_date_scalar = ScalarValue::Utf8(Some(date_str.into()));
let formatted_date_scalar = ScalarValue::from(date_str);

Comment on lines +280 to +284
let to_date_result =
ToDateFunc::new().invoke(&[ColumnarValue::Scalar(formatted_date_scalar)]);

match to_date_result {
Ok(ColumnarValue::Scalar(ScalarValue::Date32(date_val))) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machinery to call the function is also a but clunky, I wonder if it would make the tests easier to read (as there would be less conversion machinery) if we added a function like

/// Invokes ToDate on the input string, and returns the resulting Date32 value
fn run_to_date(input: &str) -> Result<i32> {
..
}

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

I believe the CI failure https://github.com/apache/datafusion/actions/runs/10725200169/job/29798407173?pr=12227 was fixed on main by @mbrobbel in #12346

I took the liberty of merging the branch up from main to try and get a clean CI run

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

We can consider potential changes to make the test more "readable" as part of a follow on PR.

Thanks agian @MartinKolbAtWork @Omega359 @findepi and @comphead

@alamb alamb merged commit 1a44ad2 into apache:main Sep 6, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"to_date" fails to process dates later than year 2262
5 participants