Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented May 20, 2020

What changes were proposed in this pull request?

Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use java.time.DateTimeFormatterBuilder since 3.0.0, which output the leading single letter of the value, e.g. December would be D. In Spark 2.4 they mean Full-Text Style.

In this PR, we explicitly disable Narrow-Text Style for these pattern characters.

Why are the changes needed?

Without this change, there will be a silent data change.

Does this PR introduce any user-facing change?

Yes, queries with datetime operations using datetime patterns, e.g. G/M/L/E/u will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters.

  1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns

2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users.

How was this patch tested?

add unit tests

@yaooqinn
Copy link
Member Author

cc @cloud-fan thanks

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122884 has finished for PR 28592 at commit 78fb74a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122898 has finished for PR 28592 at commit e178a6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn yaooqinn requested a review from cloud-fan May 21, 2020 02:45
@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122926 has finished for PR 28592 at commit 1d31ed2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,2 @@
--SET spark.sql.legacy.timeParserPolicy=CORRECTED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have different test results with CORRECTED mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just come to understand what it means, I will rm this case.


@transient
private lazy val formatter = getOrCreateFormatter(pattern, locale)
private lazy val formatter = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this lazy to let it fail fast in the parse phase? @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

Copy link
Member Author

@yaooqinn yaooqinn May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this one and the others are transient, so the lazy keyword is required.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122928 has finished for PR 28592 at commit 1144c03.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private lazy val formatter: DateTimeFormatter = {
try {
getOrCreateFormatter(pattern, locale)
} catch checkLegacyFormatter(pattern, legacyFormatter.format(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacyFormatter.format(0) is hacky... let's add the initialize API

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122935 has finished for PR 28592 at commit 549a122.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122937 has finished for PR 28592 at commit c877ac5.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122938 has finished for PR 28592 at commit b2abeeb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def format(date: Date): String
def format(localDate: LocalDate): String

def initialize(): Unit = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we force all the children to implement initialize?

And maybe a better name is validatePatternString. initialize sounds like it must be called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we should call it in TimestampFormatter.apply if the policy is not legacy, to fail earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. SGTM.

* IllegalArgumentException will be thrown.
*
* @param pattern the date time pattern
* @param block a func to capture exception, identically which forces a legacy datetime formatter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block is a bad name. How about tryLegacyFormatter?

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122958 has finished for PR 28592 at commit 1503042.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122963 has finished for PR 28592 at commit 052bfad.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122964 has finished for PR 28592 at commit 8141ef9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.



-- !query
select from_unixtime(54321, 'QQQQQ')
Copy link
Member Author

@yaooqinn yaooqinn May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to diff exception handling for IllegalArgumentException at the call sides, the results are not same https://github.com/apache/spark/pull/28592/files#diff-79dd276be45ede6f34e24ad7005b0a7cR801-R806
cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's OK. it's already the case in 2.4

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122986 has finished for PR 28592 at commit 09b407f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SparkListenerResourceProfileAdded(resourceProfile: ResourceProfile)

|**E**|day-of-week|text|Tue; Tuesday|
|**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday|
|**F**|week-of-month|number(1)|3|
|**a**|am-pm-of-day|am/pm|PM|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: am-pm, as it's weird to have / in a name.

- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. 5 or more letters will fail.

- Number: If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. The following pattern letters have constraints on the count of letters. Only one letter 'F' can be specified. Up to two letters of 'd', 'H', 'h', 'K', 'k', 'm', and 's' can be specified. Up to three letters of 'D' can be specified.
- Number(n): the n here represents the maximum count of letters this type of datetime pattern can be used. If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the -> The

J
```

- AM/PM(a): This outputs the am-pm-of-day. Pattern letter count must be 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AM/PM(a) -> am-pm

case _: Throwable => throw e
}
throw new SparkUpgradeException("3.0", s"Fail to recognize '$pattern' pattern in the" +
s" new parser. 1) You can set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new parser -> DateTimeFormatter

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122993 has finished for PR 28592 at commit 0a76ba3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122994 has finished for PR 28592 at commit 5360d88.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122990 has finished for PR 28592 at commit 75fdbcb.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122999 has finished for PR 28592 at commit ee1d62a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #123010 has finished for PR 28592 at commit 3047f88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SecondsToTimestamp(child: Expression)
  • case class MillisToTimestamp(child: Expression)
  • case class MicrosToTimestamp(child: Expression)

@cloud-fan
Copy link
Contributor

cloud-fan commented May 25, 2020

thanks, merging to master!

@cloud-fan cloud-fan closed this in 695cb61 May 25, 2020
@cloud-fan
Copy link
Contributor

Hi @yaooqinn can you send a new PR for 3.0?

@yaooqinn
Copy link
Member Author

OK,thanks for merging

private val noRows = None

private val timestampFormatter = TimestampFormatter(
private lazy val timestampFormatter = TimestampFormatter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to make it lazy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the formatter creation will validate the pattern string now, but json/csv has a fallback and shouldn't fail because of invalid pattern string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants