[SPARK-24773] Avro: support logical timestamp type with different precisions by gengliangwang · Pull Request #21935 · apache/spark

gengliangwang · 2018-07-31T18:31:16Z

What changes were proposed in this pull request?

Support reading/writing Avro logical timestamp type with different precisions
https://avro.apache.org/docs/1.8.2/spec.html#Timestamp+%28millisecond+precision%29

To specify the output timestamp type, use Dataframe option outputTimestampType or SQL config spark.sql.avro.outputTimestampType. The supported values are

TIMESTAMP_MICROS
TIMESTAMP_MILLIS

The default output type is TIMESTAMP_MICROS

How was this patch tested?

Unit test

holdensmagicalunicorn · 2018-07-31T18:31:18Z

@gengliangwang, thanks! I am a bot who has found some folks who might be able to help with the review:@cloud-fan, @gatorsmile and @HyukjinKwon

SparkQA · 2018-07-31T18:50:25Z

Test build #93838 has finished for PR 21935 at commit 3a53f55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-01T02:53:25Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

      case DateType => builder.longType()
-      case TimestampType => builder.longType()
+      case TimestampType =>
+        // To be consistent with the previous behavior of writing Timestamp type with Avro 1.7,


the previous behavior is: we can't write out timestamp data, isn't it?

also we should follow parquet and have a config spark.sql.avro.outputTimestampType to control it.

Previously we write timestamp as Long and divide the value by 1000(millisecond precision).
Maybe I need to revise the comment.
+1 on the new config.

For now I think writing out timestamp micros should be good

cloud-fan · 2018-08-01T02:54:05Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+      case TimestampType =>
+        // To be consistent with the previous behavior of writing Timestamp type with Avro 1.7,
+        // the default output Avro Timestamp type is with millisecond precision.
+        builder.longBuilder().prop(LogicalType.LOGICAL_TYPE_PROP, "timestamp-millis").endLong()


is there a better API for it? hardcoding a string is hacky.

SparkQA · 2018-08-01T03:07:13Z

Test build #93859 has finished for PR 21935 at commit fdc6c2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-01T03:40:04Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

      catalystType: DataType,
-      path: List[String]): (CatalystDataUpdater, Int, Any) => Unit =
+      path: List[String]): (CatalystDataUpdater, Int, Any) => Unit = {
+    (avroType.getLogicalType, catalystType) match {


Can we do this like:

case (LONG, TimestampType) => avroType.getLogicalType match { case _: TimestampMillis => (updater, ordinal, value) => updater.setLong(ordinal, value.asInstanceOf[Long] * 1000) case _: TimestampMicros => (updater, ordinal, value) => updater.setLong(ordinal, value.asInstanceOf[Long]) case _ => (updater, ordinal, value) => updater.setLong(ordinal, value.asInstanceOf[Long] * 1000) }

? Looks they have Avro long type anyway. Thought it's better to read and actually safer and correct.

HyukjinKwon · 2018-08-01T03:40:42Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

   * This function takes an avro schema and returns a sql schema.
   */
  def toSqlType(avroSchema: Schema): SchemaType = {
+    avroSchema.getLogicalType match {


cloud-fan · 2018-08-01T18:21:46Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

+        case _: TimestampMicros => (updater, ordinal, value) =>
+          updater.setLong(ordinal, value.asInstanceOf[Long])
+        case _ => (updater, ordinal, value) =>
+          updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)


Let's add a comment to say it's for backward compatibility reasons. Also we should only do it when logical type is null. For other logical types, we should fail here.

cloud-fan · 2018-08-01T18:22:53Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+        (getter, ordinal) => avroType.getLogicalType match {
+          case _: TimestampMillis => getter.getLong(ordinal) / 1000
+          case _: TimestampMicros => getter.getLong(ordinal)
+          case _ => getter.getLong(ordinal)


cloud-fan · 2018-08-01T18:24:35Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

-      case LONG => SchemaType(LongType, nullable = false)
+      case LONG => avroSchema.getLogicalType match {
+        case _: TimestampMillis | _: TimestampMicros =>
+          return SchemaType(TimestampType, nullable = false)


why use return here?

cloud-fan · 2018-08-01T18:28:09Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

-      case TimestampType => builder.longType()
+      case TimestampType =>
+        val timestampType = outputTimestampType match {
+          case "TIMESTAMP_MILLIS" => LogicalTypes.timestampMillis()


don't hardcode the strings, we can write

if (outputTimestampType == AvroOutputTimestampType.TIMESTAMP_MICROS.toString) ...

SparkQA · 2018-08-01T18:37:58Z

Test build #93884 has finished for PR 21935 at commit be0077a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-08-01T19:17:05Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

-      prevNameSpace: String = ""): Schema = {
+      prevNameSpace: String = "",
+      outputTimestampType: AvroOutputTimestampType.Value = AvroOutputTimestampType.TIMESTAMP_MICROS
+    ): Schema = {


Not sure if the indent here is correct.

I believe

outputTimestampType: AvroOutputTimestampType.Value = AvroOutputTimestampType.TIMESTAMP_MICROS) : Schema = {

is more correct per https://github.com/databricks/scala-style-guide#spacing-and-indentation

SparkQA · 2018-08-01T22:42:21Z

Test build #93899 has finished for PR 21935 at commit 09ad6e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-02T01:21:30Z

retest this please

HyukjinKwon · 2018-08-02T02:55:21Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

+   * from the Unix epoch. TIMESTAMP_MILLIS is also logical, but with millisecond precision,
+   * which means Spark has to truncate the microsecond portion of its timestamp value.
+   */
+  val outputTimestampType: AvroOutputTimestampType.Value = {


Hm, I wouldn't expose this as an option for now - that at least matches to Parquet's.

I'm ok with it, I think parquet should also follow this.

HyukjinKwon · 2018-08-02T02:58:29Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
-import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.{StructType, _}


Import looks a bit odd :-)

HyukjinKwon

LGTM otherwise

cloud-fan · 2018-08-02T03:53:21Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

+          // For backward compatibility, if the Avro type is Long and it is not logical type,
+          // the value is processed as timestamp type with millisecond precision.
+          updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
+      }


we should add a default case and throw IncompatibleSchemaException, in case avro add more logical types for long type in the future.

cloud-fan · 2018-08-02T03:56:00Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

        (getter, ordinal) => getter.getInt(ordinal) * DateTimeUtils.MILLIS_PER_DAY
      case TimestampType =>
-        (getter, ordinal) => getter.getLong(ordinal) / 1000
+        (getter, ordinal) => avroType.getLogicalType match {


do not do pattern match per record, we should

avroType.getLogicalType match { case _: TimestampMillis => (getter, ordinal) => ...

cloud-fan · 2018-08-02T03:57:30Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+          case _: TimestampMicros => getter.getLong(ordinal)
+          // For backward compatibility, if the Avro type is Long and it is not logical type,
+          // output the timestamp value as with millisecond precision.
+          case null => getter.getLong(ordinal) / 1000


ditto, add a default case.

cloud-fan · 2018-08-02T04:00:22Z

external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

      recordName: String = "topLevelRecord",
-      prevNameSpace: String = ""): Schema = {
+      prevNameSpace: String = "",
+      outputTimestampType: AvroOutputTimestampType.Value = AvroOutputTimestampType.TIMESTAMP_MICROS


do we really need the default value? Seems only one call site excluding the recursive ones.

It is also used in CatalystDataToAvro

HyukjinKwon · 2018-08-02T04:51:17Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

+          updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
+        case _: TimestampMicros => (updater, ordinal, value) =>
+          updater.setLong(ordinal, value.asInstanceOf[Long])
+        case null => (updater, ordinal, value) =>


ditto, add a default case.

SparkQA · 2018-08-02T05:26:23Z

Test build #93920 has finished for PR 21935 at commit 09ad6e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-02T07:03:38Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

 class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
  val episodesAvro = testFile("episodes.avro")
  val testAvro = testFile("test.avro")
+  val timestampAvro = testFile("timestamp.avro")


at least we should provide how the binary file is generated, or just do roundtrip test: Spark write avro files and then read it.

The schema and data is stated in https://github.com/apache/spark/pull/21935/files#diff-9364b0610f92b3cc35a4bc43a80751bfR397
It should be easy to get from test cases.
The other test file episodesAvro also doesn't provide how it is generated.

SparkQA · 2018-08-02T07:05:01Z

Test build #93947 has finished for PR 21935 at commit 2b286cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T07:05:02Z

Test build #93959 has finished for PR 21935 at commit 921e6cb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-08-02T09:18:41Z

retest this please.

cloud-fan · 2018-08-02T12:58:21Z

LGTM

SparkQA · 2018-08-02T13:33:45Z

Test build #93985 has finished for PR 21935 at commit 921e6cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T14:00:16Z

Test build #93998 has finished for PR 21935 at commit 499fbf3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T14:53:38Z

Test build #94004 has finished for PR 21935 at commit fed8505.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-08-02T15:35:44Z

retest this please

SparkQA · 2018-08-02T19:00:17Z

Test build #94020 has finished for PR 21935 at commit fed8505.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-08-02T19:22:04Z

retest this please

SparkQA · 2018-08-02T23:30:08Z

Test build #94052 has finished for PR 21935 at commit fed8505.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-03T00:31:47Z

Merged to master.

In PR apache#21984 and apache#21935 , the related test cases are using binary files created by Python scripts. Generate the binary files in test suite to make it more transparent. Also we can Also move the related test cases to a new file `AvroLogicalTypeSuite.scala`. Unit test. Closes apache#22091 from gengliangwang/logicalType_suite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> RB=2651977 BUG=LIHADOOP-59243 G=spark-reviewers R=ekrogen A=ekrogen

Avro: support logical timestamp type

3a53f55

fix test failure

fdc6c2c

cloud-fan reviewed Aug 1, 2018

View reviewed changes

HyukjinKwon reviewed Aug 1, 2018

View reviewed changes

address comments

be0077a

cloud-fan reviewed Aug 1, 2018

View reviewed changes

address more comments

09ad6e9

gengliangwang commented Aug 1, 2018

View reviewed changes

HyukjinKwon reviewed Aug 2, 2018

View reviewed changes

HyukjinKwon approved these changes Aug 2, 2018

View reviewed changes

cloud-fan reviewed Aug 2, 2018

View reviewed changes

address comments

2b286cd

HyukjinKwon reviewed Aug 2, 2018

View reviewed changes

address more comments

921e6cb

cloud-fan reviewed Aug 2, 2018

View reviewed changes

gengliangwang added 2 commits August 2, 2018 20:17

address comments

499fbf3

code clean up

fed8505

asfgit closed this in 7cf16a7 Aug 3, 2018

This was referenced Aug 13, 2018

[SPARK-25099][SQL][TEST] Generate Avro Binary files in test suite #22091

Closed

[SPARK-25160][SQL]Avro: remove sql configuration spark.sql.avro.outputTimestampType #22151

Closed

Conversation

gengliangwang commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

HyukjinKwon commented Aug 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

gengliangwang commented Aug 2, 2018

Uh oh!

cloud-fan commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

gengliangwang commented Jul 31, 2018 •

edited

Loading