-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") | ||
| .internal() | ||
| .booleanConf | ||
| .createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be default should be false. because PARQUET_OUTPUT_TIMESTAMP_TYPE default is INT96.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we're using the file schema, it doesn't mater what the write configuration is. It only matters what it was when the file was written. If the file has an INT96 timestamp, this should just not push anything down.
|
Test build #92796 has finished for PR 21741 at commit
|
|
retest this please |
|
Test build #92799 has finished for PR 21741 at commit
|
|
|
||
| val data = Seq(ts1, ts2, ts3, ts4) | ||
|
|
||
| withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is quite similar to the one below. Should we use a loop for setting the key SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key to avoid duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to:
// spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS
val millisData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123"),
Timestamp.valueOf("2018-06-15 08:28:53.123"),
Timestamp.valueOf("2018-06-16 08:28:53.123"),
Timestamp.valueOf("2018-06-17 08:28:53.123"))
withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
ParquetOutputTimestampType.TIMESTAMP_MILLIS.toString) {
testTimestampPushdown(millisData)
}
// spark.sql.parquet.outputTimestampType = TIMESTAMP_MICROS
val microsData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123456"),
Timestamp.valueOf("2018-06-15 08:28:53.123456"),
Timestamp.valueOf("2018-06-16 08:28:53.123456"),
Timestamp.valueOf("2018-06-17 08:28:53.123456"))
withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
ParquetOutputTimestampType.TIMESTAMP_MICROS.toString) {
testTimestampPushdown(microsData)
}We shouldn't use same data to test TIMESTAMP_MILLIS type and TIMESTAMP_MICROS type:
TIMESTAMP_MILLIStype will truncate456if usemicrosDatato test.- It can't test
DateTimeUtils.fromJavaTimestamp(t.asInstanceOf[Timestamp]if usemillisData.
| } | ||
| } | ||
|
|
||
| test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also test INT96 timestamp type. Also maybe when PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED is disabled.
|
Test build #92908 has finished for PR 21741 at commit
|
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a new line after the benchmark name? e.g.
Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)):
Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
...
We can send a follow-up PR to fix this entire file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll send a follow-up PR.
|
LGTM |
| buildConf("spark.sql.parquet.filterPushdown.timestamp") | ||
| .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + | ||
| "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + | ||
| "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shell we note INT64 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think end users have a better understanding of TIMESTAMP_MICROS and TIMESTAMP_MILLIS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I don't think ordinary users will understand any of them ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to explain how to use spark.sql.parquet.outputTimestampType to control the Parquet timestamp type Spark uses to writes parquet files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just note that push-down doesn't work for INT96 timestamps in the file. It should work for the others.
| } | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would revert this change if you are going to push more changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
|
+1 |
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
|
Test build #93006 has finished for PR 21741 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
Timestampsupport pushdown to parquet data source.Only
TIMESTAMP_MICROSandTIMESTAMP_MILLISsupport push down.How was this patch tested?
unit tests and benchmark tests