Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 3, 2020

What changes were proposed in this pull request?

In the PR, I propose to partially revert the commit 51a6ba0, and provide a legacy parser based on FastDateFormat which is compatible to SimpleDateFormat.

To enable the legacy parser, set spark.sql.legacy.timeParser.enabled to true.

Why are the changes needed?

To allow users to restore old behavior in parsing timestamps/dates using SimpleDateFormat patterns. The main reason for restoring is DateTimeFormatter's patterns are not fully compatible to SimpleDateFormat patterns, see https://issues.apache.org/jira/browse/SPARK-30668

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

  • Added new test to DateFunctionsSuite
  • Restored additional test cases in JsonInferSchemaSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

I will update the SQL migration guide soon.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

jenkins, retest this, please

@cloud-fan
Copy link
Contributor

This patch LGTM as it is (pending migration guide updates). Another concern is: since we use the new formatter by default, shall we try the old formatter if the new formatter doesn't work? Then users don't need to enable the legacy config for some cases.

@SparkQA
Copy link

SparkQA commented Feb 3, 2020

Test build #117770 has finished for PR 27441 at commit af38b87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @gatorsmile . Since you are the issue reporter, could you confirm that you are okay with this solution?

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

since we use the new formatter by default, shall we try the old formatter if the new formatter doesn't work? Then users don't need to enable the legacy config for some cases.

This may have some downsides like some values in a column could be parsed by old parser which is based on combined calendar (Julian + Gregorian) but another values in the same column by new one which uses Proleptic Gregorian calendar. This could make user life harder.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

There are some problems with jenkins:

ERROR: [WS-CLEANUP] Cannot delete workspace: remote file operation failed: /home/jenkins/workspace/SparkPullRequestBuilder at hudson.remoting.Channel@35d245c0:amp-jenkins-worker-02: java.nio.file.FileSystemException: /home/jenkins/workspace/SparkPullRequestBuilder: Read-only file system
ERROR: Cannot delete workspace: remote file operation failed: /home/jenkins/workspace/SparkPullRequestBuilder at hudson.remoting.Channel@35d245c0:amp-jenkins-worker-02: java.nio.file.FileSystemException: /home/jenkins/workspace/SparkPullRequestBuilder: Read-only file system

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

I will update the SQL migration guide soon.
pending migration guide updates

I am considering 2 options for updating the SQL migration guide:

  1. Write a separate item regarding SimpleDateFormat and DateTimeFomatter in parsing timestamps/dates using a pattern
  2. Or to modify existing items https://github.com/apache/spark/blame/cd5f03a3ba18ae455f93abc5e5d04f098fa8f046/docs/sql-migration-guide.md#L70-L72

The first option will clearly highlight the changes. @cloud-fan @dongjoon-hyun WDYT?

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 3, 2020

TODO: Need to update all docs like

follow the formats at ``java.time.format.DateTimeFormatter``. This

@cloud-fan
Copy link
Contributor

a separate item sounds good, but don't forget to remove the existing item.

TODO: Need to update all docs like

I don't think so. Legacy configs are interval and are not expected to be turned on in most cases. We don't need to mention the legacy config in user-facing documents.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 4, 2020

Legacy configs are interval ...

At the moment, the config is not internal. I will make it as internal one. Also I reviewed other legacy configs, and made some of them internal: #27448

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 4, 2020

jenkins, retest this, please

1 similar comment
@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 4, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117861 has finished for PR 27441 at commit 19a47c1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PlanAdaptiveSubqueries(subqueryMap: Map[Long, SubqueryExec]) extends Rule[SparkPlan]

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 5, 2020

jenkins, retest this, please

- CSV/JSON datasources use java.time API for parsing and generating CSV/JSON content. In Spark version 2.4 and earlier, java.text.SimpleDateFormat is used for the same purpose with fallbacks to the parsing mechanisms of Spark 2.0 and 1.x. For example, `2018-12-08 10:39:21.123` with the pattern `yyyy-MM-dd'T'HH:mm:ss.SSS` cannot be parsed since Spark 3.0 because the timestamp does not match to the pattern but it can be parsed by earlier Spark versions due to a fallback to `Timestamp.valueOf`. To parse the same timestamp since Spark 3.0, the pattern should be `yyyy-MM-dd HH:mm:ss.SSS`.

- The `unix_timestamp`, `date_format`, `to_unix_timestamp`, `from_unixtime`, `to_date`, `to_timestamp` functions. New implementation supports pattern formats as described here https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and performs strict checking of its input. For example, the `2015-07-22 10:00:00` timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does not consume whole input. Another example is the `31/01/2015 00:00` input cannot be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the range `1-12`.
- Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, `from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by users is used for parsing and formatting. Since Spark 3.0, the conversions are based on `java.time.format.DateTimeFormatter`, see https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html. New implementation performs strict checking of its input. For example, the `2015-07-22 10:00:00` timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does not consume whole input. Another example is the `31/01/2015 00:00` input cannot be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the range `1-12`. In Spark version 2.4 and earlier, `java.text.SimpleDateFormat` is used for timestamp/date string conversions, and the supported patterns are described in https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The old behavior can be restored by setting `spark.sql.legacy.timeParser.enabled` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really related to the Proleptic Gregorian calendar switch? It looks to me that we just switch to a better pattern string implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is related because SimpleDateFormat and DateTimeFormatter use different calendars underneath. Slightly different patterns are just a consequence of switching.

@cloud-fan
Copy link
Contributor

LGTM except one question.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 459e757 Feb 5, 2020
cloud-fan pushed a commit that referenced this pull request Feb 5, 2020
…estamps/dates strings

### What changes were proposed in this pull request?
In the PR, I propose to partially revert the commit 51a6ba0, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`.

To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`.

### Why are the changes needed?
To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- Restored additional test cases in `JsonInferSchemaSuite`.

Closes #27441 from MaxGekk/support-simpledateformat.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 459e757)
Signed-off-by: Wenchen Fan <[email protected]>
@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117909 has finished for PR 27441 at commit 19a47c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PlanAdaptiveSubqueries(subqueryMap: Map[Long, SubqueryExec]) extends Rule[SparkPlan]

checkTimeZoneParsing(Timestamp.valueOf("2020-01-27 20:06:11.847"))
}
withSQLConf(SQLConf.LEGACY_TIME_PARSER_ENABLED.key -> "false") {
checkTimeZoneParsing(null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback to the old parser?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent fallback to old parser can lead to mixed values in the same column - some in combined calendar Julian+Gregorian another in Proleptic Gregorian calendar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings parsed in different calendars may have difference of dozen days.

cloud-fan pushed a commit that referenced this pull request Mar 5, 2020
… for new DateFormatter

### What changes were proposed in this pull request?
This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message.

### Why are the changes needed?
Avoid silent result change for new behavior in 3.0.

### Does this PR introduce any user-facing change?
Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null.

### How was this patch tested?
Extend existing UT.

Closes #27537 from xuanyuanking/SPARK-30668-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Mar 5, 2020
… for new DateFormatter

This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message.

Avoid silent result change for new behavior in 3.0.

Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null.

Extend existing UT.

Closes #27537 from xuanyuanking/SPARK-30668-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 7db0af5)
Signed-off-by: Wenchen Fan <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… for new DateFormatter

### What changes were proposed in this pull request?
This is a follow-up work for apache#27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message.

### Why are the changes needed?
Avoid silent result change for new behavior in 3.0.

### Does this PR introduce any user-facing change?
Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null.

### How was this patch tested?
Extend existing UT.

Closes apache#27537 from xuanyuanking/SPARK-30668-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the support-simpledateformat branch June 5, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants