-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24718][SQL] Timestamp support pushdown to parquet data source #21741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b2a9000
ff31610
5471d79
f206457
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -378,6 +378,15 @@ object SQLConf { | |
| .booleanConf | ||
| .createWithDefault(true) | ||
|
|
||
| val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = | ||
| buildConf("spark.sql.parquet.filterPushdown.timestamp") | ||
| .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + | ||
| "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + | ||
| "enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") | ||
| .internal() | ||
| .booleanConf | ||
| .createWithDefault(true) | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May be default should be
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because we're using the file schema, it doesn't mater what the write configuration is. It only matters what it was when the file was written. If the file has an INT96 timestamp, this should just not push anything down. |
||
|
|
||
| val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED = | ||
| buildConf("spark.sql.parquet.filterPushdown.string.startsWith") | ||
| .doc("If true, enables Parquet filter push-down optimization for string startsWith function. " + | ||
|
|
@@ -1494,6 +1503,8 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def parquetFilterPushDownDate: Boolean = getConf(PARQUET_FILTER_PUSHDOWN_DATE_ENABLED) | ||
|
|
||
| def parquetFilterPushDownTimestamp: Boolean = getConf(PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED) | ||
|
|
||
| def parquetFilterPushDownStringStartWith: Boolean = | ||
| getConf(PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -578,3 +578,127 @@ Native ORC Vectorized 11622 / 12196 1.4 7 | |
| Native ORC Vectorized (Pushdown) 11377 / 11654 1.4 723.3 1.0X | ||
|
|
||
|
|
||
| ================================================================================================ | ||
| Pushdown benchmark for Timestamp | ||
| ================================================================================================ | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we add a new line after the benchmark name? e.g. We can send a follow-up PR to fix this entire file.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. I'll send a follow-up PR. |
||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 4784 / 4956 3.3 304.2 1.0X | ||
| Parquet Vectorized (Pushdown) 4838 / 4917 3.3 307.6 1.0X | ||
| Native ORC Vectorized 3923 / 4173 4.0 249.4 1.2X | ||
| Native ORC Vectorized (Pushdown) 894 / 943 17.6 56.8 5.4X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 10% timestamp stored as INT96 rows (value < CAST(1572864 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 5686 / 5901 2.8 361.5 1.0X | ||
| Parquet Vectorized (Pushdown) 5555 / 5895 2.8 353.2 1.0X | ||
| Native ORC Vectorized 4844 / 4957 3.2 308.0 1.2X | ||
| Native ORC Vectorized (Pushdown) 2141 / 2230 7.3 136.1 2.7X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 50% timestamp stored as INT96 rows (value < CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 9100 / 9421 1.7 578.6 1.0X | ||
| Parquet Vectorized (Pushdown) 9122 / 9496 1.7 580.0 1.0X | ||
| Native ORC Vectorized 8365 / 8874 1.9 531.9 1.1X | ||
| Native ORC Vectorized (Pushdown) 7128 / 7376 2.2 453.2 1.3X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 90% timestamp stored as INT96 rows (value < CAST(14155776 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 12764 / 13120 1.2 811.5 1.0X | ||
| Parquet Vectorized (Pushdown) 12656 / 13003 1.2 804.7 1.0X | ||
| Native ORC Vectorized 13096 / 13233 1.2 832.6 1.0X | ||
| Native ORC Vectorized (Pushdown) 12710 / 15611 1.2 808.1 1.0X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 1 timestamp stored as TIMESTAMP_MICROS row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 4381 / 4796 3.6 278.5 1.0X | ||
| Parquet Vectorized (Pushdown) 122 / 137 129.3 7.7 36.0X | ||
| Native ORC Vectorized 3913 / 3988 4.0 248.8 1.1X | ||
| Native ORC Vectorized (Pushdown) 905 / 945 17.4 57.6 4.8X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 10% timestamp stored as TIMESTAMP_MICROS rows (value < CAST(1572864 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 5145 / 5184 3.1 327.1 1.0X | ||
| Parquet Vectorized (Pushdown) 1426 / 1519 11.0 90.7 3.6X | ||
| Native ORC Vectorized 4827 / 4901 3.3 306.9 1.1X | ||
| Native ORC Vectorized (Pushdown) 2133 / 2210 7.4 135.6 2.4X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 50% timestamp stored as TIMESTAMP_MICROS rows (value < CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 9234 / 9516 1.7 587.1 1.0X | ||
| Parquet Vectorized (Pushdown) 6752 / 7046 2.3 429.3 1.4X | ||
| Native ORC Vectorized 8418 / 8998 1.9 535.2 1.1X | ||
| Native ORC Vectorized (Pushdown) 7199 / 7314 2.2 457.7 1.3X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 90% timestamp stored as TIMESTAMP_MICROS rows (value < CAST(14155776 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 12414 / 12458 1.3 789.2 1.0X | ||
| Parquet Vectorized (Pushdown) 12094 / 12249 1.3 768.9 1.0X | ||
| Native ORC Vectorized 12198 / 13755 1.3 775.5 1.0X | ||
| Native ORC Vectorized (Pushdown) 12205 / 12431 1.3 776.0 1.0X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 1 timestamp stored as TIMESTAMP_MILLIS row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 4369 / 4515 3.6 277.8 1.0X | ||
| Parquet Vectorized (Pushdown) 116 / 125 136.2 7.3 37.8X | ||
| Native ORC Vectorized 3965 / 4703 4.0 252.1 1.1X | ||
| Native ORC Vectorized (Pushdown) 892 / 1162 17.6 56.7 4.9X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 10% timestamp stored as TIMESTAMP_MILLIS rows (value < CAST(1572864 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 5211 / 5409 3.0 331.3 1.0X | ||
| Parquet Vectorized (Pushdown) 1427 / 1438 11.0 90.7 3.7X | ||
| Native ORC Vectorized 4719 / 4883 3.3 300.1 1.1X | ||
| Native ORC Vectorized (Pushdown) 2191 / 2228 7.2 139.3 2.4X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 50% timestamp stored as TIMESTAMP_MILLIS rows (value < CAST(7864320 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 8716 / 8953 1.8 554.2 1.0X | ||
| Parquet Vectorized (Pushdown) 6632 / 6968 2.4 421.7 1.3X | ||
| Native ORC Vectorized 8376 / 9118 1.9 532.5 1.0X | ||
| Native ORC Vectorized (Pushdown) 7218 / 7609 2.2 458.9 1.2X | ||
|
|
||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 | ||
| Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz | ||
|
|
||
| Select 90% timestamp stored as TIMESTAMP_MILLIS rows (value < CAST(14155776 AS timestamp)): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| Parquet Vectorized 12264 / 12452 1.3 779.7 1.0X | ||
| Parquet Vectorized (Pushdown) 11766 / 11927 1.3 748.0 1.0X | ||
| Native ORC Vectorized 12101 / 12301 1.3 769.3 1.0X | ||
| Native ORC Vectorized (Pushdown) 11983 / 12651 1.3 761.9 1.0X | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shell we note
INT64here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think end users have a better understanding of
TIMESTAMP_MICROSandTIMESTAMP_MILLIS.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I don't think ordinary users will understand any of them ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to explain how to use
spark.sql.parquet.outputTimestampTypeto control the Parquet timestamp type Spark uses to writes parquet files.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just note that push-down doesn't work for INT96 timestamps in the file. It should work for the others.