Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
45f4827
support orc file meta cache
LuciferYang Aug 16, 2021
a74c793
remove ForTailCacheReader
LuciferYang Aug 16, 2021
1a9bd3b
rename test case
LuciferYang Aug 16, 2021
30df269
remove private[sql] and add comments
LuciferYang Aug 16, 2021
42d2bfd
Reduce method encapsulation
LuciferYang Aug 16, 2021
0e6c52b
use PrivateMethodTester
LuciferYang Aug 16, 2021
95bae3c
add a configable maximumSize
LuciferYang Aug 16, 2021
ebb7e0b
rename config
LuciferYang Aug 17, 2021
c36a569
move test
LuciferYang Aug 17, 2021
b02de85
update benchmark result
LuciferYang Aug 17, 2021
406f91c
revert config name
LuciferYang Aug 18, 2021
3b15a82
add compile same type
LuciferYang Aug 18, 2021
2b99983
update mirco bench
LuciferYang Aug 18, 2021
82ddf4f
change the default value of ttlSinceLastAccess
LuciferYang Aug 18, 2021
1dd174e
change the default value of ttlSinceLastAccess
LuciferYang Aug 18, 2021
a339b1b
update conf doc to add Warning
LuciferYang Aug 18, 2021
7153d2a
use a list config
LuciferYang Aug 18, 2021
c3838e6
Add checkValue to spark.sql.fileMetaCache.enabledSourceList and test …
LuciferYang Aug 19, 2021
59d5bb9
change to use guava cache and update benchmark
LuciferYang Aug 19, 2021
4adeb62
rename test case
LuciferYang Aug 19, 2021
e5f9497
add SEC to ttl
LuciferYang Aug 19, 2021
ec8fa1c
Revert "change to use guava cache and update benchmark"
LuciferYang Aug 19, 2021
db90daf
Revert "Revert "change to use guava cache and update benchmark""
LuciferYang Aug 22, 2021
7327fdb
Merge branch 'upmaster' into SPARK-36516
LuciferYang Aug 22, 2021
2907b2c
Merge branch 'master' of github.com:apache/spark into SPARK-36516
LuciferYang Aug 23, 2021
a7eff43
Merge branch 'upmaster' into SPARK-36516
LuciferYang Sep 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -963,6 +963,36 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val FILE_META_CACHE_ENABLED_SOURCE_LIST = buildConf("spark.sql.fileMetaCache.enabledSourceList")
.doc("A comma-separated list of data source short names for which data source enabled file " +
"meta cache, now the file meta cache only support ORC, it is recommended to enabled this " +
"config when multiple queries are performed on the same dataset, default is false." +
"Warning: if the fileMetaCache is enabled, the existing data files should not be " +
"replaced with the same file name, otherwise there will be a risk of job failure or wrong " +
"data reading before the cache entry expires.")
.version("3.3.0")
.stringConf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add .checkValue? The valid value is only orc in this PR.
After merging this PR, you can extend it to parquet inside Parquet PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c3838e6 add .checkValue and test case

.checkValue(value => {
val valueList = value.toLowerCase(Locale.ROOT).split(",").map(_.trim)
value.trim.isEmpty || valueList.length == 1 && valueList.contains("orc")
}, s"spark.sql.fileMetaCache.enabledSourceList only support orc now")
.createWithDefault("")

val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS_SEC =
buildConf("spark.sql.fileMetaCache.ttlSinceLastAccessSec")
.version("3.3.0")
.doc("Time-to-live for file metadata cache entry after last access, the unit is seconds.")
.timeConf(TimeUnit.SECONDS)
.createWithDefault(600L)

val FILE_META_CACHE_MAXIMUM_SIZE =
buildConf("spark.sql.fileMetaCache.maximumSize")
.version("3.3.0")
.doc("Maximum number of file meta entries the file meta cache contains.")
.intConf
.checkValue(_ > 0, "The value of fileMetaCache maximumSize must be positive")
.createWithDefault(1000)

val HIVE_VERIFY_PARTITION_PATH = buildConf("spark.sql.hive.verifyPartitionPath")
.doc("When true, check all the partition paths under the table\'s root directory " +
"when reading data stored in HDFS. This configuration will be deprecated in the future " +
Expand Down Expand Up @@ -3621,6 +3651,12 @@ class SQLConf extends Serializable with Logging {

def parquetVectorizedReaderBatchSize: Int = getConf(PARQUET_VECTORIZED_READER_BATCH_SIZE)

def fileMetaCacheEnabled(ds: String): Boolean = {
val enabledList = getConf(FILE_META_CACHE_ENABLED_SOURCE_LIST).toLowerCase(Locale.ROOT)
.split(",").map(_.trim)
enabledList.contains(ds.toLowerCase(Locale.ROOT))
}

def columnBatchSize: Int = getConf(COLUMN_BATCH_SIZE)

def cacheVectorizedReaderEnabled: Boolean = getConf(CACHE_VECTORIZED_READER_ENABLED)
Expand Down
95 changes: 95 additions & 0 deletions sql/core/benchmarks/FileMetaCacheReadBenchmark-jdk11-results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
================================================================================================
count(*) From 100 files
================================================================================================

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 10 columns with 100 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 217 225 5 24.1 41.5 1.0X
count(*): fileMetaCacheEnabled = true 153 156 2 34.3 29.1 1.4X
count(*) with Filter: fileMetaCacheEnabled = false 436 444 7 12.0 83.1 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 377 379 2 13.9 72.0 0.6X

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 50 columns with 100 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 221 239 16 23.7 42.2 1.0X
count(*): fileMetaCacheEnabled = true 173 183 8 30.2 33.1 1.3X
count(*) with Filter: fileMetaCacheEnabled = false 494 496 2 10.6 94.3 0.4X
count(*) with Filter: fileMetaCacheEnabled = true 431 433 2 12.2 82.2 0.5X

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 100 columns with 100 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 287 289 2 18.3 54.8 1.0X
count(*): fileMetaCacheEnabled = true 221 228 6 23.8 42.1 1.3X
count(*) with Filter: fileMetaCacheEnabled = false 553 555 2 9.5 105.4 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 504 506 2 10.4 96.1 0.6X


================================================================================================
count(*) From 500 files
================================================================================================

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 10 columns with 500 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 772 814 72 6.8 147.3 1.0X
count(*): fileMetaCacheEnabled = true 534 537 2 9.8 101.9 1.4X
count(*) with Filter: fileMetaCacheEnabled = false 1341 1343 3 3.9 255.8 0.6X
count(*) with Filter: fileMetaCacheEnabled = true 1115 1116 1 4.7 212.7 0.7X

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 50 columns with 500 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 793 881 117 6.6 151.3 1.0X
count(*): fileMetaCacheEnabled = true 564 569 4 9.3 107.6 1.4X
count(*) with Filter: fileMetaCacheEnabled = false 1473 1475 3 3.6 281.0 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 1253 1254 1 4.2 238.9 0.6X

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 100 columns with 500 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 862 902 45 6.1 164.4 1.0X
count(*): fileMetaCacheEnabled = true 623 631 9 8.4 118.9 1.4X
count(*) with Filter: fileMetaCacheEnabled = false 1695 1698 4 3.1 323.3 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 1437 1445 11 3.6 274.1 0.6X


================================================================================================
count(*) From 1000 files
================================================================================================

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 10 columns with 1000 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 1459 1501 59 3.6 278.3 1.0X
count(*): fileMetaCacheEnabled = true 1091 1092 1 4.8 208.0 1.3X
count(*) with Filter: fileMetaCacheEnabled = false 2518 2520 3 2.1 480.2 0.6X
count(*) with Filter: fileMetaCacheEnabled = true 2122 2130 11 2.5 404.7 0.7X

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 50 columns with 1000 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 1505 1506 1 3.5 287.0 1.0X
count(*): fileMetaCacheEnabled = true 1138 1138 1 4.6 217.1 1.3X
count(*) with Filter: fileMetaCacheEnabled = false 2787 2798 16 1.9 531.5 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 2405 2405 1 2.2 458.7 0.6X

OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 100 columns with 1000 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 1610 1610 1 3.3 307.0 1.0X
count(*): fileMetaCacheEnabled = true 1299 1308 13 4.0 247.7 1.2X
count(*) with Filter: fileMetaCacheEnabled = false 3121 3123 3 1.7 595.4 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 2828 2828 1 1.9 539.3 0.6X

95 changes: 95 additions & 0 deletions sql/core/benchmarks/FileMetaCacheReadBenchmark-results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
================================================================================================
count(*) From 100 files
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 10 columns with 100 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 190 196 8 27.6 36.2 1.0X
count(*): fileMetaCacheEnabled = true 134 138 5 39.2 25.5 1.4X
count(*) with Filter: fileMetaCacheEnabled = false 377 384 8 13.9 72.0 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 328 333 6 16.0 62.6 0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 50 columns with 100 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 187 192 8 28.0 35.7 1.0X
count(*): fileMetaCacheEnabled = true 146 150 6 35.9 27.9 1.3X
count(*) with Filter: fileMetaCacheEnabled = false 396 400 7 13.2 75.5 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 351 355 5 14.9 67.0 0.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 100 columns with 100 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 237 241 6 22.1 45.2 1.0X
count(*): fileMetaCacheEnabled = true 192 197 6 27.3 36.6 1.2X
count(*) with Filter: fileMetaCacheEnabled = false 465 471 8 11.3 88.8 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 422 426 7 12.4 80.5 0.6X


================================================================================================
count(*) From 500 files
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 10 columns with 500 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 647 656 6 8.1 123.4 1.0X
count(*): fileMetaCacheEnabled = true 431 437 7 12.2 82.3 1.5X
count(*) with Filter: fileMetaCacheEnabled = false 1157 1160 5 4.5 220.7 0.6X
count(*) with Filter: fileMetaCacheEnabled = true 934 947 11 5.6 178.2 0.7X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 50 columns with 500 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 673 684 9 7.8 128.5 1.0X
count(*): fileMetaCacheEnabled = true 461 468 9 11.4 87.9 1.5X
count(*) with Filter: fileMetaCacheEnabled = false 1277 1280 5 4.1 243.5 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 1052 1066 20 5.0 200.6 0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 100 columns with 500 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 720 726 11 7.3 137.3 1.0X
count(*): fileMetaCacheEnabled = true 503 509 10 10.4 96.0 1.4X
count(*) with Filter: fileMetaCacheEnabled = false 1468 1469 1 3.6 280.0 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 1232 1238 9 4.3 234.9 0.6X


================================================================================================
count(*) From 1000 files
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 10 columns with 1000 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 1239 1245 9 4.2 236.3 1.0X
count(*): fileMetaCacheEnabled = true 995 996 2 5.3 189.7 1.2X
count(*) with Filter: fileMetaCacheEnabled = false 2161 2169 12 2.4 412.1 0.6X
count(*) with Filter: fileMetaCacheEnabled = true 1864 1865 1 2.8 355.5 0.7X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 50 columns with 1000 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 1292 1294 3 4.1 246.5 1.0X
count(*): fileMetaCacheEnabled = true 1086 1097 16 4.8 207.2 1.2X
count(*) with Filter: fileMetaCacheEnabled = false 2388 2396 12 2.2 455.4 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 2176 2177 0 2.4 415.1 0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.14.0_1-0-0-42
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
count(*) from 100 columns with 1000 files: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
count(*): fileMetaCacheEnabled = false 1371 1372 2 3.8 261.5 1.0X
count(*): fileMetaCacheEnabled = true 1084 1096 17 4.8 206.7 1.3X
count(*) with Filter: fileMetaCacheEnabled = false 2698 2708 13 1.9 514.7 0.5X
count(*) with Filter: fileMetaCacheEnabled = true 2408 2408 0 2.2 459.2 0.6X

Loading