[SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions#28650
[SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions#28650yaooqinn wants to merge 13 commits intoapache:masterfrom yaooqinn:SPARK-31830
Conversation
|
retest this please |
|
Test build #123160 has finished for PR 28650 at commit
|
|
Test build #123165 has finished for PR 28650 at commit
|
|
retest this please |
|
cc @cloud-fan thanks. |
|
retest this please |
|
Test build #123182 has finished for PR 28650 at commit
|
|
Test build #123213 has finished for PR 28650 at commit
|
| } | ||
| // Test escaping of format | ||
| GenerateUnsafeProjection.generate(FromUnixTime(Literal(0L), Literal("\"quote"), UTC_OPT) :: Nil) | ||
| GenerateUnsafeProjection.generate(FromUnixTime(Literal(0L), Literal("\""), UTC_OPT) :: Nil) |
There was a problem hiding this comment.
quote contains invalid format patterns, removing it does not affect the purpose of this test case.
| checkEvaluation( | ||
| FromUnixTime(Literal(1000L), Literal.create(null, StringType), timeZoneId), | ||
| null) | ||
| checkEvaluation( |
There was a problem hiding this comment.
tests for invalid formats are gathered into this case https://github.com/apache/spark/pull/28650/files#diff-3cae35e69eb5cf9538d1d0947c13d8ebR1165.
| UnixTimestamp(Literal(date1), Literal.create(null, StringType), timeZoneId), | ||
| MICROSECONDS.toSeconds( | ||
| DateTimeUtils.daysToMicros(DateTimeUtils.fromJavaDate(date1), tz.toZoneId))) | ||
| checkEvaluation( |
There was a problem hiding this comment.
tests for invalid formats are gathered into this case https://github.com/apache/spark/pull/28650/files#diff-3cae35e69eb5cf9538d1d0947c13d8ebR1165.
| Literal(date1), Literal.create(null, StringType), timeZoneId), | ||
| MICROSECONDS.toSeconds( | ||
| DateTimeUtils.daysToMicros(DateTimeUtils.fromJavaDate(date1), zid))) | ||
| checkEvaluation( |
There was a problem hiding this comment.
tests for invalid formats are gathered into this case https://github.com/apache/spark/pull/28650/files#diff-3cae35e69eb5cf9538d1d0947c13d8ebR1165.
| select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')); | ||
| select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')); | ||
|
|
||
| select date_format(null, null); |
There was a problem hiding this comment.
These tests were added because datetime function end-2-end tests were missing so far
There was a problem hiding this comment.
we can add from_utc_timestamp and to_utc_timestamp here later too
|
Test build #123217 has finished for PR 28650 at commit
|
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
| zoneId = zoneId, | ||
| legacyFormat = SIMPLE_DATE_FORMAT, | ||
| needVarLengthSecondFraction = isParsing) | ||
| formatter.validatePatternString() |
There was a problem hiding this comment.
do we need this? It's already done in TimestampFormatter.apply AFAIK
There was a problem hiding this comment.
TimestampFormatter.apply works for the new Parser. Adding this to fail the legacy formatter here will give us a narrower scope of influence.
There was a problem hiding this comment.
can we call validatePatternString in TimestampFormatter.apply for legacy parser as well?
There was a problem hiding this comment.
Let me push a commit to verify with jenkins
There was a problem hiding this comment.
this looks ok through Jenkins
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
|
Test build #123230 has finished for PR 28650 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
Why we don't throw |
that logic belongs to |
|
So we didn't fail when we construct the formatter with |
|
No the pattern it will do the same as |
|
But maybe we should apply |
|
Test build #123239 has finished for PR 28650 at commit
|
Yea, can you open a PR for it? We should merge that to 3.0. |
|
OK |
|
Since this PR is for master only, let's fix the |
|
Test build #123674 has finished for PR 28650 at commit
|
| @transient lazy val zoneId: ZoneId = DateTimeUtils.getZoneId(timeZoneId.get) | ||
| } | ||
|
|
||
| trait TimestampFormatterHelper extends TimeZoneAwareExpression { |
There was a problem hiding this comment.
do we have expressions that create DateFormatter?
There was a problem hiding this comment.
no datetime functions, only csv/json ones
| ToUnixTimestamp(Literal("1"), Literal(c.toString)), "3.0") | ||
| checkExceptionInExpression[SparkUpgradeException]( | ||
| UnixTimestamp(Literal("1"), Literal(c.toString)), "3.0") | ||
| def checkException[T <: Exception : ClassTag](c: String, onlyParsing: Boolean = false): Unit = { |
There was a problem hiding this comment.
do we need the onlyParsing flag? we can get it by Seq('E', 'F', 'q', 'Q').contains(c)
| Row(null), Row(null), Row(null), Row(null))) | ||
| val invalid = df1.selectExpr(s"to_unix_timestamp(x, 'yyyy-MM-dd bb:HH:ss')") | ||
| val e = intercept[IllegalArgumentException](invalid.collect()) | ||
| assert(e.getMessage.contains('b')) |
There was a problem hiding this comment.
can we check more in the error message?
There was a problem hiding this comment.
Just pick the intersection of Illegal pattern character 'b' and Unknown pattern letter: b
| select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss"); | ||
|
|
||
| -- Unsupported narrow text style | ||
| select date_format(date '2020-05-23', 'GGGGG'); |
There was a problem hiding this comment.
why we remove these tests?
There was a problem hiding this comment.
datetime-formatting-invalid.sql should be enough to cover these cases
|
Test build #123680 has finished for PR 28650 at commit
|
| case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[String] = None) | ||
| extends BinaryExpression with TimeZoneAwareExpression with ImplicitCastInputTypes { | ||
| extends BinaryExpression with TimestampFormatterHelper with ImplicitCastInputTypes | ||
| with NullIntolerant { |
There was a problem hiding this comment.
shall we add the NullIntolerant in the base class TimestampFormatterHelper? and it seems better to do it in a new PR.
|
Test build #123687 has finished for PR 28650 at commit
|
|
Test build #123689 has finished for PR 28650 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Currently,
date_formatandfrom_unixtime,unix_timestamp,to_unix_timestamp,to_timestamp,to_datehave different exception handling behavior for formatting datetime values.In this PR, we apply the exception handling behavior of
date_formattofrom_unixtime,unix_timestamp,to_unix_timestamp,to_timestampandto_date.In the phase of creating the datetime formatted or formating, exceptions will be raised.
e.g.
In the phase of parsing,
DateTimeParseException | DateTimeException | ParseExceptionwill be suppressed, butSparkUpgradeExceptionwill still be raisede.g.
Why are the changes needed?
Consistency
Does this PR introduce any user-facing change?
Yes, invalid datetime patterns will fail
from_unixtime,unix_timestamp,to_unix_timestamp,to_timestampandto_dateinstead of resultingNULLHow was this patch tested?
add more tests