-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30668][SQL] Support SimpleDateFormat patterns in parsing timestamps/dates strings
#27441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I will update the SQL migration guide soon. |
|
jenkins, retest this, please |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
This patch LGTM as it is (pending migration guide updates). Another concern is: since we use the new formatter by default, shall we try the old formatter if the new formatter doesn't work? Then users don't need to enable the legacy config for some cases. |
|
Test build #117770 has finished for PR 27441 at commit
|
|
Hi, @gatorsmile . Since you are the issue reporter, could you confirm that you are okay with this solution? |
This may have some downsides like some values in a column could be parsed by old parser which is based on combined calendar (Julian + Gregorian) but another values in the same column by new one which uses Proleptic Gregorian calendar. This could make user life harder. |
|
There are some problems with jenkins: |
I am considering 2 options for updating the SQL migration guide:
The first option will clearly highlight the changes. @cloud-fan @dongjoon-hyun WDYT? |
|
TODO: Need to update all docs like spark/python/pyspark/sql/readwriter.py Line 224 in e5abbab
|
|
a separate item sounds good, but don't forget to remove the existing item.
I don't think so. Legacy configs are interval and are not expected to be turned on in most cases. We don't need to mention the legacy config in user-facing documents. |
At the moment, the config is not internal. I will make it as internal one. Also I reviewed other legacy configs, and made some of them internal: #27448 |
|
jenkins, retest this, please |
1 similar comment
|
jenkins, retest this, please |
|
Test build #117861 has finished for PR 27441 at commit
|
|
jenkins, retest this, please |
| - CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`. | ||
|
|
||
| - The `unix_timestamp`, `date_format`, `to_unix_timestamp`, `from_unixtime`, `to_date`, `to_timestamp` functions. New implementation supports pattern formats as described here https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and performs strict checking of its input. For example, the `2015-07-22 10:00:00` timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does not consume whole input. Another example is the `31/01/2015 00:00` input cannot be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the range `1-12`. | ||
| - Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, `from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by users is used for parsing and formatting. Since Spark 3.0, the conversions are based on `java.time.format.DateTimeFormatter`, see https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html. New implementation performs strict checking of its input. For example, the `2015-07-22 10:00:00` timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does not consume whole input. Another example is the `31/01/2015 00:00` input cannot be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the range `1-12`. In Spark version 2.4 and earlier, `java.text.SimpleDateFormat` is used for timestamp/date string conversions, and the supported patterns are described in https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The old behavior can be restored by setting `spark.sql.legacy.timeParser.enabled` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really related to the Proleptic Gregorian calendar switch? It looks to me that we just switch to a better pattern string implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is related because SimpleDateFormat and DateTimeFormatter use different calendars underneath. Slightly different patterns are just a consequence of switching.
|
LGTM except one question. |
|
thanks, merging to master/3.0! |
…estamps/dates strings ### What changes were proposed in this pull request? In the PR, I propose to partially revert the commit 51a6ba0, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`. To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`. ### Why are the changes needed? To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added new test to `DateFunctionsSuite` - Restored additional test cases in `JsonInferSchemaSuite`. Closes #27441 from MaxGekk/support-simpledateformat. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 459e757) Signed-off-by: Wenchen Fan <[email protected]>
|
Test build #117909 has finished for PR 27441 at commit
|
| checkTimeZoneParsing(Timestamp.valueOf("2020-01-27 20:06:11.847")) | ||
| } | ||
| withSQLConf(SQLConf.LEGACY_TIME_PARSER_ENABLED.key -> "false") { | ||
| checkTimeZoneParsing(null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fallback to the old parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silent fallback to old parser can lead to mixed values in the same column - some in combined calendar Julian+Gregorian another in Proleptic Gregorian calendar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings parsed in different calendars may have difference of dozen days.
… for new DateFormatter ### What changes were proposed in this pull request? This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message. ### Why are the changes needed? Avoid silent result change for new behavior in 3.0. ### Does this PR introduce any user-facing change? Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null. ### How was this patch tested? Extend existing UT. Closes #27537 from xuanyuanking/SPARK-30668-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… for new DateFormatter This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message. Avoid silent result change for new behavior in 3.0. Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null. Extend existing UT. Closes #27537 from xuanyuanking/SPARK-30668-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 7db0af5) Signed-off-by: Wenchen Fan <[email protected]>
… for new DateFormatter ### What changes were proposed in this pull request? This is a follow-up work for apache#27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message. ### Why are the changes needed? Avoid silent result change for new behavior in 3.0. ### Does this PR introduce any user-facing change? Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null. ### How was this patch tested? Extend existing UT. Closes apache#27537 from xuanyuanking/SPARK-30668-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to partially revert the commit 51a6ba0, and provide a legacy parser based on
FastDateFormatwhich is compatible toSimpleDateFormat.To enable the legacy parser, set
spark.sql.legacy.timeParser.enabledtotrue.Why are the changes needed?
To allow users to restore old behavior in parsing timestamps/dates using
SimpleDateFormatpatterns. The main reason for restoring isDateTimeFormatter's patterns are not fully compatible toSimpleDateFormatpatterns, see https://issues.apache.org/jira/browse/SPARK-30668Does this PR introduce any user-facing change?
Yes
How was this patch tested?
DateFunctionsSuiteJsonInferSchemaSuite.