[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

wangyum · 2018-07-10T06:49:29Z

What changes were proposed in this pull request?

Timestamp support pushdown to parquet data source.
Only TIMESTAMP_MICROS and TIMESTAMP_MILLIS support push down.

How was this patch tested?

unit tests and benchmark tests

wangyum · 2018-07-10T06:53:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.")
+    .internal()
+    .booleanConf
+    .createWithDefault(true)


May be default should be false. because PARQUET_OUTPUT_TIMESTAMP_TYPE default is INT96.

Because we're using the file schema, it doesn't mater what the write configuration is. It only matters what it was when the file was written. If the file has an INT96 timestamp, this should just not push anything down.

SparkQA · 2018-07-10T07:05:01Z

Test build #92796 has finished for PR 21741 at commit b2a9000.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-07-10T07:17:29Z

retest this please

SparkQA · 2018-07-10T11:15:53Z

Test build #92799 has finished for PR 21741 at commit b2a9000.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-11T13:50:55Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+
+    val data = Seq(ts1, ts2, ts3, ts4)
+
+    withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->


This case is quite similar to the one below. Should we use a loop for setting the key SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key to avoid duplicated code.

I changed to:

// spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS val millisData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123"), Timestamp.valueOf("2018-06-15 08:28:53.123"), Timestamp.valueOf("2018-06-16 08:28:53.123"), Timestamp.valueOf("2018-06-17 08:28:53.123")) withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> ParquetOutputTimestampType.TIMESTAMP_MILLIS.toString) { testTimestampPushdown(millisData) } // spark.sql.parquet.outputTimestampType = TIMESTAMP_MICROS val microsData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123456"), Timestamp.valueOf("2018-06-15 08:28:53.123456"), Timestamp.valueOf("2018-06-16 08:28:53.123456"), Timestamp.valueOf("2018-06-17 08:28:53.123456")) withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> ParquetOutputTimestampType.TIMESTAMP_MICROS.toString) { testTimestampPushdown(microsData) }

We shouldn't use same data to test TIMESTAMP_MILLIS type and TIMESTAMP_MICROS type:

TIMESTAMP_MILLIS type will truncate 456 if use microsData to test.

It can't test DateTimeUtils.fromJavaTimestamp(t.asInstanceOf[Timestamp] if use millisData.

gengliangwang · 2018-07-11T13:52:28Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

    }
  }

+  test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") {


I think we should also test INT96 timestamp type. Also maybe when PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED is disabled.

SparkQA · 2018-07-12T05:37:04Z

Test build #92908 has finished for PR 21741 at commit 5471d79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang

LGTM

gatorsmile · 2018-07-13T06:03:26Z

cc @michal-databricks @mswit-databricks @rdblue @cloud-fan

cloud-fan · 2018-07-13T07:05:24Z

sql/core/benchmarks/FilterPushdownBenchmark-results.txt

+Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
+
+Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative


shall we add a new line after the benchmark name? e.g.

Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ...

We can send a follow-up PR to fix this entire file.

OK. I'll send a follow-up PR.

cloud-fan · 2018-07-13T07:05:43Z

LGTM

HyukjinKwon · 2018-07-13T07:10:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.parquet.filterPushdown.timestamp")
+      .doc("If true, enables Parquet filter push-down optimization for Timestamp. " +
+        "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " +
+        "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.")


Shell we note INT64 here?

I think end users have a better understanding of TIMESTAMP_MICROS and TIMESTAMP_MILLIS.

... I don't think ordinary users will understand any of them ..

You need to explain how to use spark.sql.parquet.outputTimestampType to control the Parquet timestamp type Spark uses to writes parquet files.

I would just note that push-down doesn't work for INT96 timestamps in the file. It should work for the others.

HyukjinKwon · 2018-07-13T07:12:43Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

    }
  }

-


nit: I would revert this change if you are going to push more changes.

HyukjinKwon

LGTM too

rdblue · 2018-07-13T19:34:27Z

+1

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

SparkQA · 2018-07-14T19:10:20Z

Test build #93006 has finished for PR 21741 at commit f206457.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SerializableConfiguration(@transient var value: Configuration)
class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)
case class SchemaType(dataType: DataType, nullable: Boolean)
implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T])
implicit class AvroDataFrameReader(reader: DataFrameReader)
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],
abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast
case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike

HyukjinKwon · 2018-07-15T03:12:59Z

Merged to master.

Timestamp support pushdown to parquet data source

b2a9000

wangyum commented Jul 10, 2018

View reviewed changes

gengliangwang reviewed Jul 11, 2018

View reviewed changes

wangyum added 2 commits July 12, 2018 09:38

Merge remote-tracking branch 'upstream/master' into SPARK-24718

ff31610

Refactor test

5471d79

gengliangwang approved these changes Jul 12, 2018

View reviewed changes

cloud-fan reviewed Jul 13, 2018

View reviewed changes

HyukjinKwon reviewed Jul 13, 2018

View reviewed changes

HyukjinKwon approved these changes Jul 13, 2018

View reviewed changes

asfgit closed this in 43e4e85 Jul 15, 2018


		val data = Seq(ts1, ts2, ts3, ts4)

		withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741

Uh oh!

Conversation

wangyum commented Jul 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

wangyum commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2018

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 13, 2018

Uh oh!

SparkQA commented Jul 14, 2018

Uh oh!

HyukjinKwon commented Jul 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

HyukjinKwon Jul 13, 2018 •

edited

Loading