-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' #28592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan thanks |
|
Test build #122884 has finished for PR 28592 at commit
|
|
Test build #122898 has finished for PR 28592 at commit
|
|
Test build #122926 has finished for PR 28592 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,2 @@ | |||
| --SET spark.sql.legacy.timeParserPolicy=CORRECTED | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have different test results with CORRECTED mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just come to understand what it means, I will rm this case.
|
|
||
| @transient | ||
| private lazy val formatter = getOrCreateFormatter(pattern, locale) | ||
| private lazy val formatter = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this lazy to let it fail fast in the parse phase? @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this one and the others are transient, so the lazy keyword is required.
|
Test build #122928 has finished for PR 28592 at commit
|
| private lazy val formatter: DateTimeFormatter = { | ||
| try { | ||
| getOrCreateFormatter(pattern, locale) | ||
| } catch checkLegacyFormatter(pattern, legacyFormatter.format(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legacyFormatter.format(0) is hacky... let's add the initialize API
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
Outdated
Show resolved
Hide resolved
|
Test build #122935 has finished for PR 28592 at commit
|
|
Test build #122937 has finished for PR 28592 at commit
|
|
Test build #122938 has finished for PR 28592 at commit
|
| def format(date: Date): String | ||
| def format(localDate: LocalDate): String | ||
|
|
||
| def initialize(): Unit = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we force all the children to implement initialize?
And maybe a better name is validatePatternString. initialize sounds like it must be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we should call it in TimestampFormatter.apply if the policy is not legacy, to fail earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. SGTM.
| * IllegalArgumentException will be thrown. | ||
| * | ||
| * @param pattern the date time pattern | ||
| * @param block a func to capture exception, identically which forces a legacy datetime formatter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block is a bad name. How about tryLegacyFormatter?
|
Test build #122958 has finished for PR 28592 at commit
|
|
Test build #122963 has finished for PR 28592 at commit
|
|
Test build #122964 has finished for PR 28592 at commit
|
|
|
||
|
|
||
| -- !query | ||
| select from_unixtime(54321, 'QQQQQ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to diff exception handling for IllegalArgumentException at the call sides, the results are not same https://github.com/apache/spark/pull/28592/files#diff-79dd276be45ede6f34e24ad7005b0a7cR801-R806
cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's OK. it's already the case in 2.4
|
Test build #122986 has finished for PR 28592 at commit
|
docs/sql-ref-datetime-pattern.md
Outdated
| |**E**|day-of-week|text|Tue; Tuesday| | ||
| |**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday| | ||
| |**F**|week-of-month|number(1)|3| | ||
| |**a**|am-pm-of-day|am/pm|PM| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: am-pm, as it's weird to have / in a name.
docs/sql-ref-datetime-pattern.md
Outdated
| - Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. 5 or more letters will fail. | ||
|
|
||
| - Number: If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. The following pattern letters have constraints on the count of letters. Only one letter 'F' can be specified. Up to two letters of 'd', 'H', 'h', 'K', 'k', 'm', and 's' can be specified. Up to three letters of 'D' can be specified. | ||
| - Number(n): the n here represents the maximum count of letters this type of datetime pattern can be used. If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the -> The
docs/sql-ref-datetime-pattern.md
Outdated
| J | ||
| ``` | ||
|
|
||
| - AM/PM(a): This outputs the am-pm-of-day. Pattern letter count must be 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AM/PM(a) -> am-pm
| case _: Throwable => throw e | ||
| } | ||
| throw new SparkUpgradeException("3.0", s"Fail to recognize '$pattern' pattern in the" + | ||
| s" new parser. 1) You can set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new parser -> DateTimeFormatter
|
Test build #122993 has finished for PR 28592 at commit
|
|
Test build #122994 has finished for PR 28592 at commit
|
|
Test build #122990 has finished for PR 28592 at commit
|
|
Test build #122999 has finished for PR 28592 at commit
|
|
Test build #123010 has finished for PR 28592 at commit
|
|
thanks, merging to master! |
|
Hi @yaooqinn can you send a new PR for 3.0? |
|
OK,thanks for merging |
| private val noRows = None | ||
|
|
||
| private val timestampFormatter = TimestampFormatter( | ||
| private lazy val timestampFormatter = TimestampFormatter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to make it lazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the formatter creation will validate the pattern string now, but json/csv has a fallback and shouldn't fail because of invalid pattern string.
What changes were proposed in this pull request?
Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use
java.time.DateTimeFormatterBuildersince 3.0.0, which output the leading single letter of the value, e.g.Decemberwould beD. In Spark 2.4 they mean Full-Text Style.In this PR, we explicitly disable Narrow-Text Style for these pattern characters.
Why are the changes needed?
Without this change, there will be a silent data change.
Does this PR introduce any user-facing change?
Yes, queries with datetime operations using datetime patterns, e.g.
G/M/L/E/uwill fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters.2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users.
How was this patch tested?
add unit tests