Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 21 additions & 28 deletions docs/sql-ref-datetime-pattern.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,25 +30,25 @@ Spark uses pattern letters in the following table for date and timestamp parsing

|Symbol|Meaning|Presentation|Examples|
|------|-------|------------|--------|
|**G**|era|text|AD; Anno Domini; A|
|**G**|era|text|AD; Anno Domini|
|**y**|year|year|2020; 20|
|**D**|day-of-year|number|189|
|**M/L**|month-of-year|number/text|7; 07; Jul; July; J|
|**d**|day-of-month|number|28|
|**D**|day-of-year|number(3)|189|
|**M/L**|month-of-year|month|7; 07; Jul; July|
|**d**|day-of-month|number(3)|28|
|**Q/q**|quarter-of-year|number/text|3; 03; Q3; 3rd quarter|
|**Y**|week-based-year|year|1996; 96|
|**w**|week-of-week-based-year|number|27|
|**W**|week-of-month|number|4|
|**E**|day-of-week|text|Tue; Tuesday; T|
|**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday; T|
|**F**|week-of-month|number|3|
|**a**|am-pm-of-day|text|PM|
|**h**|clock-hour-of-am-pm (1-12)|number|12|
|**K**|hour-of-am-pm (0-11)|number|0|
|**k**|clock-hour-of-day (1-24)|number|0|
|**H**|hour-of-day (0-23)|number|0|
|**m**|minute-of-hour|number|30|
|**s**|second-of-minute|number|55|
|**w**|week-of-week-based-year|number(2)|27|
|**W**|week-of-month|number(1)|4|
|**E**|day-of-week|text|Tue; Tuesday|
|**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday|
|**F**|week-of-month|number(1)|3|
|**a**|am-pm-of-day|am-pm|PM|
|**h**|clock-hour-of-am-pm (1-12)|number(2)|12|
|**K**|hour-of-am-pm (0-11)|number(2)|0|
|**k**|clock-hour-of-day (1-24)|number(2)|0|
|**H**|hour-of-day (0-23)|number(2)|0|
|**m**|minute-of-hour|number(2)|30|
|**s**|second-of-minute|number(2)|55|
|**S**|fraction-of-second|fraction|978|
|**V**|time-zone ID|zone-id|America/Los_Angeles; Z; -08:30|
|**z**|time-zone name|zone-name|Pacific Standard Time; PST|
Expand All @@ -63,9 +63,9 @@ Spark uses pattern letters in the following table for date and timestamp parsing

The count of pattern letters determines the format.

- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. Six or more letters will fail.
- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. 5 or more letters will fail.

- Number: If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. The following pattern letters have constraints on the count of letters. Only one letter 'F' can be specified. Up to two letters of 'd', 'H', 'h', 'K', 'k', 'm', and 's' can be specified. Up to three letters of 'D' can be specified.
- Number(n): The n here represents the maximum count of letters this type of datetime pattern can be used. If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary.

- Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the Number rules above.

Expand All @@ -76,7 +76,7 @@ The count of pattern letters determines the format.

- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present.

- Month: If the number of pattern letters is 3 or more, the month is interpreted as text; otherwise, it is interpreted as a number. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. The difference between the 'standard' and 'stand-alone' forms is trickier to describe as there is no difference in English. However, in other languages there is a difference in the word used when the text is used alone, as opposed to in a complete date. For example, the word used for a month when used alone in a date picker is different to the word used for month in association with a day and year in a date. In Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters (more than 5 letters is invalid):
- Month: If the number of pattern letters is 3 or more, the month is interpreted as text; otherwise, it is interpreted as a number. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. The difference between the 'standard' and 'stand-alone' forms is trickier to describe as there is no difference in English. However, in other languages there is a difference in the word used when the text is used alone, as opposed to in a complete date. For example, the word used for a month when used alone in a date picker is different to the word used for month in association with a day and year in a date. In Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters (more than 4 letters is invalid):
- `'M'` or `'L'`: Month number in a year starting from 1. There is no difference between 'M' and 'L'. Month from 1 to 9 are printed without padding.
```sql
spark-sql> select date_format(date '1970-01-01', "M");
Expand Down Expand Up @@ -119,13 +119,8 @@ The count of pattern letters determines the format.
spark-sql> select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'LLLL', 'locale', 'RU'));
январь
```
- `'LLLLL'` or `'MMMMM'`: Narrow textual representation of standard or stand-alone forms. Typically it is a single letter.
```sql
spark-sql> select date_format(date '1970-07-01', "LLLLL");
J
spark-sql> select date_format(date '1970-01-01', "MMMMM");
J
```

- am-pm: This outputs the am-pm-of-day. Pattern letter count must be 1.

- Zone ID(V): This outputs the display the time-zone ID. Pattern letter count must be 2.

Expand All @@ -147,5 +142,3 @@ More details for the text style:
- Short Form: Short text, typically an abbreviation. For example, day-of-week Monday might output "Mon".

- Full Form: Full text, typically the full description. For example, day-of-week Monday might output "Monday".

- Narrow Form: Narrow text, typically a single letter. For example, day-of-week Monday might output "M".
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,13 @@ class UnivocityParser(
// We preallocate it avoid unnecessary allocations.
private val noRows = None

private val timestampFormatter = TimestampFormatter(
private lazy val timestampFormatter = TimestampFormatter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to make it lazy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the formatter creation will validate the pattern string now, but json/csv has a fallback and shouldn't fail because of invalid pattern string.

options.timestampFormat,
options.zoneId,
options.locale,
legacyFormat = FAST_DATE_FORMAT,
needVarLengthSecondFraction = true)
private val dateFormatter = DateFormatter(
private lazy val dateFormatter = DateFormatter(
options.dateFormat,
options.zoneId,
options.locale,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -880,6 +880,7 @@ abstract class ToTimestamp
legacyFormat = SIMPLE_DATE_FORMAT,
needVarLengthSecondFraction = true)
} catch {
case e: SparkUpgradeException => throw e
case NonFatal(_) => null
}

Expand Down Expand Up @@ -1061,6 +1062,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[
legacyFormat = SIMPLE_DATE_FORMAT,
needVarLengthSecondFraction = false)
} catch {
case e: SparkUpgradeException => throw e
case NonFatal(_) => null
}

Expand All @@ -1076,6 +1078,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[
try {
UTF8String.fromString(formatter.format(time.asInstanceOf[Long] * MICROS_PER_SECOND))
} catch {
case e: SparkUpgradeException => throw e
case NonFatal(_) => null
}
}
Expand All @@ -1093,6 +1096,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[
needVarLengthSecondFraction = false)
.format(time.asInstanceOf[Long] * MICROS_PER_SECOND))
} catch {
case e: SparkUpgradeException => throw e
case NonFatal(_) => null
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,13 @@ class JacksonParser(

private val factory = options.buildJsonFactory()

private val timestampFormatter = TimestampFormatter(
private lazy val timestampFormatter = TimestampFormatter(
options.timestampFormat,
options.zoneId,
options.locale,
legacyFormat = FAST_DATE_FORMAT,
needVarLengthSecondFraction = true)
private val dateFormatter = DateFormatter(
private lazy val dateFormatter = DateFormatter(
options.dateFormat,
options.zoneId,
options.locale,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.util

import java.text.SimpleDateFormat
import java.time.{LocalDate, ZoneId}
import java.time.format.DateTimeFormatter
import java.util.{Date, Locale}

import org.apache.commons.lang3.time.FastDateFormat
Expand All @@ -33,6 +34,8 @@ sealed trait DateFormatter extends Serializable {
def format(days: Int): String
def format(date: Date): String
def format(localDate: LocalDate): String

def validatePatternString(): Unit
}

class Iso8601DateFormatter(
Expand Down Expand Up @@ -70,6 +73,12 @@ class Iso8601DateFormatter(
override def format(date: Date): String = {
legacyFormatter.format(date)
}

override def validatePatternString(): Unit = {
try {
formatter
} catch checkLegacyFormatter(pattern, legacyFormatter.validatePatternString)
}
}

trait LegacyDateFormatter extends DateFormatter {
Expand All @@ -93,13 +102,16 @@ class LegacyFastDateFormatter(pattern: String, locale: Locale) extends LegacyDat
private lazy val fdf = FastDateFormat.getInstance(pattern, locale)
override def parseToDate(s: String): Date = fdf.parse(s)
override def format(d: Date): String = fdf.format(d)
override def validatePatternString(): Unit = fdf
}

class LegacySimpleDateFormatter(pattern: String, locale: Locale) extends LegacyDateFormatter {
@transient
private lazy val sdf = new SimpleDateFormat(pattern, locale)
override def parseToDate(s: String): Date = sdf.parse(s)
override def format(d: Date): String = sdf.format(d)
override def validatePatternString(): Unit = sdf

}

object DateFormatter {
Expand All @@ -118,7 +130,9 @@ object DateFormatter {
if (SQLConf.get.legacyTimeParserPolicy == LEGACY) {
getLegacyFormatter(pattern, zoneId, locale, legacyFormat)
} else {
new Iso8601DateFormatter(pattern, zoneId, locale, legacyFormat)
val df = new Iso8601DateFormatter(pattern, zoneId, locale, legacyFormat)
df.validatePatternString()
df
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,34 @@ trait DateTimeFormatterHelper {
s"set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to restore the behavior " +
s"before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.", e)
}

/**
* When the new DateTimeFormatter failed to initialize because of invalid datetime pattern, it
* will throw IllegalArgumentException. If the pattern can be recognized by the legacy formatter
* it will raise SparkUpgradeException to tell users to restore the previous behavior via LEGACY
* policy or follow our guide to correct their pattern. Otherwise, the original
* IllegalArgumentException will be thrown.
*
* @param pattern the date time pattern
* @param tryLegacyFormatter a func to capture exception, identically which forces a legacy
* datetime formatter to be initialized
*/

protected def checkLegacyFormatter(
pattern: String,
tryLegacyFormatter: => Unit): PartialFunction[Throwable, DateTimeFormatter] = {
case e: IllegalArgumentException =>
try {
tryLegacyFormatter
} catch {
case _: Throwable => throw e
}
throw new SparkUpgradeException("3.0", s"Fail to recognize '$pattern' pattern in the" +
s" DateTimeFormatter. 1) You can set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY" +
s" to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern" +
s" with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html",
e)
}
}

private object DateTimeFormatterHelper {
Expand Down Expand Up @@ -190,6 +218,8 @@ private object DateTimeFormatterHelper {
}

final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p')
final val unsupportedNarrowTextStyle =
Set("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "QQQQQ", "qqqqq", "uuuuu")

/**
* In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for
Expand All @@ -211,6 +241,9 @@ private object DateTimeFormatterHelper {
for (c <- patternPart if unsupportedLetters.contains(c)) {
throw new IllegalArgumentException(s"Illegal pattern character: $c")
}
for (style <- unsupportedNarrowTextStyle if patternPart.contains(style)) {
throw new IllegalArgumentException(s"Too many pattern letters: ${style.head}")
}
// The meaning of 'u' was day number of week in SimpleDateFormat, it was changed to year
// in DateTimeFormatter. Substitute 'u' to 'e' and use DateTimeFormatter to parse the
// string. If parsable, return the result; otherwise, fall back to 'u', and then use the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ sealed trait TimestampFormatter extends Serializable {
def format(us: Long): String
def format(ts: Timestamp): String
def format(instant: Instant): String
def validatePatternString(): Unit
}

class Iso8601TimestampFormatter(
Expand Down Expand Up @@ -99,6 +100,12 @@ class Iso8601TimestampFormatter(
override def format(ts: Timestamp): String = {
legacyFormatter.format(ts)
}

override def validatePatternString(): Unit = {
try {
formatter
} catch checkLegacyFormatter(pattern, legacyFormatter.validatePatternString)
}
}

/**
Expand Down Expand Up @@ -202,6 +209,8 @@ class LegacyFastTimestampFormatter(
override def format(instant: Instant): String = {
format(instantToMicros(instant))
}

override def validatePatternString(): Unit = fastDateFormat
}

class LegacySimpleTimestampFormatter(
Expand Down Expand Up @@ -231,6 +240,8 @@ class LegacySimpleTimestampFormatter(
override def format(instant: Instant): String = {
format(instantToMicros(instant))
}

override def validatePatternString(): Unit = sdf
}

object LegacyDateFormats extends Enumeration {
Expand All @@ -255,8 +266,10 @@ object TimestampFormatter {
if (SQLConf.get.legacyTimeParserPolicy == LEGACY) {
getLegacyFormatter(pattern, zoneId, locale, legacyFormat)
} else {
new Iso8601TimestampFormatter(
val tf = new Iso8601TimestampFormatter(
pattern, zoneId, locale, legacyFormat, needVarLengthSecondFraction)
tf.validatePatternString()
tf
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {

// Test escaping of format
GenerateUnsafeProjection.generate(
DateFormatClass(Literal(ts), Literal("\"quote"), JST_OPT) :: Nil)
DateFormatClass(Literal(ts), Literal("\""), JST_OPT) :: Nil)

// SPARK-28072 The codegen path should work
checkEvaluation(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

package org.apache.spark.sql.catalyst.util

import org.apache.spark.SparkFunSuite
import org.apache.spark.{SparkFunSuite, SparkUpgradeException}
import org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper._

class DateTimeFormatterHelperSuite extends SparkFunSuite {
Expand All @@ -40,6 +40,16 @@ class DateTimeFormatterHelperSuite extends SparkFunSuite {
val e = intercept[IllegalArgumentException](convertIncompatiblePattern(s"yyyy-MM-dd $l G"))
assert(e.getMessage === s"Illegal pattern character: $l")
}
unsupportedNarrowTextStyle.foreach { style =>
val e1 = intercept[IllegalArgumentException] {
convertIncompatiblePattern(s"yyyy-MM-dd $style")
}
assert(e1.getMessage === s"Too many pattern letters: ${style.head}")
val e2 = intercept[IllegalArgumentException] {
convertIncompatiblePattern(s"yyyy-MM-dd $style${style.head}")
}
assert(e2.getMessage === s"Too many pattern letters: ${style.head}")
}
assert(convertIncompatiblePattern("yyyy-MM-dd uuuu") === "uuuu-MM-dd eeee")
assert(convertIncompatiblePattern("yyyy-MM-dd EEEE") === "uuuu-MM-dd EEEE")
assert(convertIncompatiblePattern("yyyy-MM-dd'e'HH:mm:ss") === "uuuu-MM-dd'e'HH:mm:ss")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -396,4 +396,15 @@ class TimestampFormatterSuite extends SparkFunSuite with SQLHelper with Matchers
val micros = formatter.parse("2009 11")
assert(micros === date(2009, 1, 1, 11))
}

test("explicitly forbidden datetime patterns") {
// not support by the legacy one too
Seq("QQQQQ", "qqqqq", "A", "c", "e", "n", "N", "p").foreach { pattern =>
intercept[IllegalArgumentException](TimestampFormatter(pattern, UTC).format(0))
}
// supported by the legacy one, then we will suggest users with SparkUpgradeException
Seq("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa").foreach { pattern =>
intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0))
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--SET spark.sql.legacy.timeParserPolicy=LEGACY
--IMPORT datetime.sql
20 changes: 20 additions & 0 deletions sql/core/src/test/resources/sql-tests/inputs/datetime.sql
Original file line number Diff line number Diff line change
Expand Up @@ -140,3 +140,23 @@ select to_date("16", "dd");
select to_date("02-29", "MM-dd");
select to_timestamp("2019 40", "yyyy mm");
select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss");

-- Unsupported narrow text style
select date_format(date '2020-05-23', 'GGGGG');
select date_format(date '2020-05-23', 'MMMMM');
select date_format(date '2020-05-23', 'LLLLL');
select date_format(timestamp '2020-05-23', 'EEEEE');
select date_format(timestamp '2020-05-23', 'uuuuu');
select date_format('2020-05-23', 'QQQQQ');
select date_format('2020-05-23', 'qqqqq');
select to_timestamp('2019-10-06 A', 'yyyy-MM-dd GGGGG');
select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE');
select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE');
select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE');
select from_unixtime(12345, 'MMMMM');
select from_unixtime(54321, 'QQQQQ');
select from_unixtime(23456, 'aaaaa');
select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'));
select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));
select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'));
select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));
Loading