Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -175,10 +175,30 @@ class UnivocityParser(
}

case _: TimestampType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(timestampFormatter.parse)
nullSafeDatum(d, name, nullable, options) { datum =>
try {
timestampFormatter.parse(datum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also shadow this fallback behavior under legacy config? Or JSON/CSV will not keep the same behavior with the SQL side?
As the current approach, it seems to break the rule we want to achieve in #27537: throw an exception when the result changing between old and new Spark versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be protected by a config.

The fallback was there at the very beginning without any config, and I think it's reasonable to support the legacy format always, to make the parser more relaxed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy, that makes sense.
So this fallback logic is kind of guard logic for the parser, no matter the parser is new or legacy one.
After this merged, #27537 need to address the logic conflict.

} catch {
case NonFatal(e) =>
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards
// compatibility.
val str = UTF8String.fromString(DateTimeUtils.cleanLegacyTimestampStr(datum))
DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at what was removed there https://github.com/apache/spark/pull/23150/files#diff-c82e4b74d2a51fed29069745ce4f9e96L164 , DateTimeUtils.stringToTime and DateTimeUtils.stringToTimestamp are 2 different functions, see https://github.com/MaxGekk/spark/blob/60c0974261c947c0838457c40f4fe0e64d17ca15/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L173-L191

stringToTime has a few problems - it doesn't respect to Spark's session time zone, and parses in the combined calendar. If you use it as a fallback, you can get parsed values in different calendars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I just want to support the legacy format, not to be exactly the same with 2.4. We switch the calendar so we can't be exactly the same anyway.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't be exactly the same, why do we add this back?
cc @marmbrus and @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't be exactly the same in 1% cases, and this change can make us exactly the same in 99% cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What cases used to be supported that are no longer supported?
  • What remains in the 1%?
  • Why can we not get 100%?

To me, Spark suddenly failing to parse timestamps that it used to parse fall into the "extremely annoying" category of regressions. Working with timestamps is already hard and confusing. If my job just suddenly starts returning null when it used to work as expected its:

  • Probably going to take a while for people to figure it out (while the null partition of their table mysteriously fills up).
  • Probably non-trival for them to figure out how to fix it. (I don't think constructing / testing format strings is a trivial activity).

I really question why we changed these API's rather than creating new better ones. These changes broke my pipeline and it took me hours + access to a team of spark committers to figure out why!!!!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a corner case, but let's add it back as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Here is a draft PR which supports any zone ids by stringToTimestamp: #27753

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for keeping moving (improving) this situation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk that's a good complement to support more legacy formats!

}
}

case _: DateType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(dateFormatter.parse)
nullSafeDatum(d, name, nullable, options) { datum =>
try {
dateFormatter.parse(datum)
} catch {
case NonFatal(e) =>
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards
// compatibility.
val str = UTF8String.fromString(DateTimeUtils.cleanLegacyTimestampStr(datum))
DateTimeUtils.stringToDate(str, options.zoneId).getOrElse(throw e)
}
}

case _: StringType => (d: String) =>
nullSafeDatum(d, name, nullable, options)(UTF8String.fromString)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ import java.io.{ByteArrayOutputStream, CharConversionException}
import java.nio.charset.MalformedInputException

import scala.collection.mutable.ArrayBuffer
import scala.util.Try
import scala.util.control.NonFatal

import com.fasterxml.jackson.core._
Expand Down Expand Up @@ -230,7 +229,15 @@ class JacksonParser(
case TimestampType =>
(parser: JsonParser) => parseJsonToken[java.lang.Long](parser, dataType) {
case VALUE_STRING if parser.getTextLength >= 1 =>
timestampFormatter.parse(parser.getText)
try {
timestampFormatter.parse(parser.getText)
} catch {
case NonFatal(e) =>
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards
// compatibility.
val str = UTF8String.fromString(DateTimeUtils.cleanLegacyTimestampStr(parser.getText))
DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
}

case VALUE_NUMBER_INT =>
parser.getLongValue * 1000000L
Expand All @@ -239,7 +246,23 @@ class JacksonParser(
case DateType =>
(parser: JsonParser) => parseJsonToken[java.lang.Integer](parser, dataType) {
case VALUE_STRING if parser.getTextLength >= 1 =>
dateFormatter.parse(parser.getText)
try {
dateFormatter.parse(parser.getText)
} catch {
case NonFatal(e) =>
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards
// compatibility.
val str = UTF8String.fromString(DateTimeUtils.cleanLegacyTimestampStr(parser.getText))
DateTimeUtils.stringToDate(str, options.zoneId).getOrElse {
// In Spark 1.5.0, we store the data as number of days since epoch in string.
// So, we just convert it to Int.
try {
parser.getText.toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rebase this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I think we should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #28453

} catch {
case _: NumberFormatException => throw e
}
}.asInstanceOf[Integer]
}
}

case BinaryType =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,21 @@ object DateTimeUtils {
instantToMicros(localDateTime.atZone(zoneId).toInstant)
}

// A method called by JSON/CSV parser to clean up the legacy timestamp string by removing the
// "GMT" string.
def cleanLegacyTimestampStr(s: String): String = {
val indexOfGMT = s.indexOf("GMT")
if (indexOfGMT != -1) {
// ISO8601 with a weird time zone specifier (2000-01-01T00:00GMT+01:00)
val s0 = s.substring(0, indexOfGMT)
val s1 = s.substring(indexOfGMT + 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 cases for GMT formatting, GMT+8, GMT+08:00 and GMT, do we need to check all in UnivocityParserSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be fixed by #27753

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, got it thanks.

// Mapped to 2000-01-01T00:00+01:00
s0 + s1
} else {
s
}
}

/**
* Trim and parse a given UTF8 date string to the corresponding a corresponding [[Long]] value.
* The return type is [[Option]] in order to distinguish between 0L and null. The following
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.plans.SQLHelper
import org.apache.spark.sql.catalyst.util.DateTimeConstants._
import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
import org.apache.spark.sql.catalyst.util.DateTimeUtils
import org.apache.spark.sql.sources.{EqualTo, Filter, StringStartsWith}
import org.apache.spark.sql.types._
Expand Down Expand Up @@ -318,4 +319,44 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
}.getMessage
assert(errMsg2.contains("i does not exist"))
}

test("SPARK-30960: parse date/timestamp string with legacy format") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there is no UT for the json parser. The test-ability of the json parser is not good...

def check(parser: UnivocityParser): Unit = {
// The legacy format allows 1 or 2 chars for some fields.
assert(parser.makeConverter("t", TimestampType).apply("2020-1-12 12:3:45") ==
date(2020, 1, 12, 12, 3, 45, 0))
assert(parser.makeConverter("t", DateType).apply("2020-1-12") ==
days(2020, 1, 12, 0, 0, 0))
// The legacy format allows arbitrary length of second fraction.
assert(parser.makeConverter("t", TimestampType).apply("2020-1-12 12:3:45.1") ==
date(2020, 1, 12, 12, 3, 45, 100000))
assert(parser.makeConverter("t", TimestampType).apply("2020-1-12 12:3:45.1234") ==
date(2020, 1, 12, 12, 3, 45, 123400))
// The legacy format allow date string to end with T or space, with arbitrary string
assert(parser.makeConverter("t", DateType).apply("2020-1-12T") ==
days(2020, 1, 12, 0, 0, 0))
assert(parser.makeConverter("t", DateType).apply("2020-1-12Txyz") ==
days(2020, 1, 12, 0, 0, 0))
assert(parser.makeConverter("t", DateType).apply("2020-1-12 ") ==
days(2020, 1, 12, 0, 0, 0))
assert(parser.makeConverter("t", DateType).apply("2020-1-12 xyz") ==
days(2020, 1, 12, 0, 0, 0))
// The legacy format ignores the "GMT" from the string
assert(parser.makeConverter("t", TimestampType).apply("2020-1-12 12:3:45GMT") ==
date(2020, 1, 12, 12, 3, 45, 0))
assert(parser.makeConverter("t", TimestampType).apply("GMT2020-1-12 12:3:45") ==
date(2020, 1, 12, 12, 3, 45, 0))
assert(parser.makeConverter("t", DateType).apply("2020-1-12GMT") ==
days(2020, 1, 12, 0, 0, 0))
assert(parser.makeConverter("t", DateType).apply("GMT2020-1-12") ==
days(2020, 1, 12, 0, 0, 0))
}

val options = new CSVOptions(Map.empty[String, String], false, "UTC")
check(new UnivocityParser(StructType(Seq.empty), options))

val optionsWithPattern =
new CSVOptions(Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, "UTC")
check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2301,6 +2301,12 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
assert(csv.schema.fieldNames === Seq("a", "b", "0"))
checkAnswer(csv, Row("a", "b", 1))
}

test("SPARK-30960: parse date/timestamp string with legacy format") {
val ds = Seq("2020-1-12 3:23:34.12, 2020-1-12 T").toDS()
val csv = spark.read.option("header", false).schema("t timestamp, d date").csv(ds)
checkAnswer(csv, Row(Timestamp.valueOf("2020-1-12 3:23:34.12"), Date.valueOf("2020-1-12")))
}
}

class CSVv1Suite extends CSVSuite {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import java.io._
import java.nio.charset.{Charset, StandardCharsets, UnsupportedCharsetException}
import java.nio.file.Files
import java.sql.{Date, Timestamp}
import java.time.{LocalDate, LocalDateTime, ZoneId}
import java.util.Locale

import com.fasterxml.jackson.core.JsonFactory
Expand All @@ -39,6 +40,7 @@ import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.SharedSparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.StructType.fromDDL
import org.apache.spark.sql.types.TestUDT.{MyDenseVector, MyDenseVectorUDT}
import org.apache.spark.util.Utils

class TestFileFilter extends PathFilter {
Expand Down Expand Up @@ -1447,6 +1449,107 @@ abstract class JsonSuite extends QueryTest with SharedSparkSession with TestJson
})
}

test("backward compatibility") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case was removed in 3.0. Now add it back.

// This test we make sure our JSON support can read JSON data generated by previous version
// of Spark generated through toJSON method and JSON data source.
// The data is generated by the following program.
// Here are a few notes:
// - Spark 1.5.0 cannot save timestamp data. So, we manually added timestamp field (col13)
// in the JSON object.
// - For Spark before 1.5.1, we do not generate UDTs. So, we manually added the UDT value to
// JSON objects generated by those Spark versions (col17).
// - If the type is NullType, we do not write data out.

// Create the schema.
val struct =
StructType(
StructField("f1", FloatType, true) ::
StructField("f2", ArrayType(BooleanType), true) :: Nil)

val dataTypes =
Seq(
StringType, BinaryType, NullType, BooleanType,
ByteType, ShortType, IntegerType, LongType,
FloatType, DoubleType, DecimalType(25, 5), DecimalType(6, 5),
DateType, TimestampType,
ArrayType(IntegerType), MapType(StringType, LongType), struct,
new MyDenseVectorUDT())
val fields = dataTypes.zipWithIndex.map { case (dataType, index) =>
StructField(s"col$index", dataType, nullable = true)
}
val schema = StructType(fields)

val constantValues =
Seq(
"a string in binary".getBytes(StandardCharsets.UTF_8),
null,
true,
1.toByte,
2.toShort,
3,
Long.MaxValue,
0.25.toFloat,
0.75,
new java.math.BigDecimal(s"1234.23456"),
new java.math.BigDecimal(s"1.23456"),
java.sql.Date.valueOf("2015-01-01"),
java.sql.Timestamp.valueOf("2015-01-01 23:50:59.123"),
Seq(2, 3, 4),
Map("a string" -> 2000L),
Row(4.75.toFloat, Seq(false, true)),
new MyDenseVector(Array(0.25, 2.25, 4.25)))
val data =
Row.fromSeq(Seq("Spark " + spark.sparkContext.version) ++ constantValues) :: Nil

// Data generated by previous versions.
// scalastyle:off
val existingJSONData =
"""{"col0":"Spark 1.2.2","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"2015-01-01","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" ::
"""{"col0":"Spark 1.3.1","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"2015-01-01","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" ::
"""{"col0":"Spark 1.3.1","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"2015-01-01","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" ::
"""{"col0":"Spark 1.4.1","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"2015-01-01","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" ::
"""{"col0":"Spark 1.4.1","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"2015-01-01","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" ::
"""{"col0":"Spark 1.5.0","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"2015-01-01","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" ::
"""{"col0":"Spark 1.5.0","col1":"YSBzdHJpbmcgaW4gYmluYXJ5","col3":true,"col4":1,"col5":2,"col6":3,"col7":9223372036854775807,"col8":0.25,"col9":0.75,"col10":1234.23456,"col11":1.23456,"col12":"16436","col13":"2015-01-01 23:50:59.123","col14":[2,3,4],"col15":{"a string":2000},"col16":{"f1":4.75,"f2":[false,true]},"col17":[0.25,2.25,4.25]}""" :: Nil
// scalastyle:on

// Generate data for the current version.
val df = spark.createDataFrame(spark.sparkContext.parallelize(data, 1), schema)
withTempPath { path =>
df.write.format("json").mode("overwrite").save(path.getCanonicalPath)

// df.toJSON will convert internal rows to external rows first and then generate
// JSON objects. While, df.write.format("json") will write internal rows directly.
val allJSON =
existingJSONData ++
df.toJSON.collect() ++
sparkContext.textFile(path.getCanonicalPath).collect()

Utils.deleteRecursively(path)
sparkContext.parallelize(allJSON, 1).saveAsTextFile(path.getCanonicalPath)

// Read data back with the schema specified.
val col0Values =
Seq(
"Spark 1.2.2",
"Spark 1.3.1",
"Spark 1.3.1",
"Spark 1.4.1",
"Spark 1.4.1",
"Spark 1.5.0",
"Spark 1.5.0",
"Spark " + spark.sparkContext.version,
"Spark " + spark.sparkContext.version)
val expectedResult = col0Values.map { v =>
Row.fromSeq(Seq(v) ++ constantValues)
}
checkAnswer(
spark.read.format("json").schema(schema).load(path.getCanonicalPath),
expectedResult
)
}
}

test("SPARK-11544 test pathfilter") {
withTempPath { dir =>
val path = dir.getCanonicalPath
Expand Down Expand Up @@ -2557,6 +2660,15 @@ abstract class JsonSuite extends QueryTest with SharedSparkSession with TestJson
checkAnswer(readBack, timestampsWithFormat)
}
}

test("SPARK-30960: parse date/timestamp string with legacy format") {
val ds = Seq("{'t': '2020-1-12 3:23:34.12', 'd': '2020-1-12 T', 'd2': '12345'}").toDS()
val json = spark.read.schema("t timestamp, d date, d2 date").json(ds)
checkAnswer(json, Row(
Timestamp.valueOf("2020-1-12 3:23:34.12"),
Date.valueOf("2020-1-12"),
Date.valueOf(LocalDate.ofEpochDay(12345))))
}
}

class JsonV1Suite extends JsonSuite {
Expand Down