[SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache #28852

sap1ens · 2020-06-17T21:18:47Z

What changes were proposed in this pull request?

New spark.sql.metadataCacheTTLSeconds option that adds time-to-live cache behaviour to the existing caches in FileStatusCache and SessionCatalog.

Why are the changes needed?

Currently Spark caches file listing for tables and requires issuing REFRESH TABLE any time the file listing has changed outside of Spark. Unfortunately, simply submitting REFRESH TABLE commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution.

This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.).

A similar feature exists in Presto: hive.file-status-cache-expire-time can be found here.

Does this PR introduce any user-facing change?

Yes, it's controlled with the new spark.sql.metadataCacheTTLSeconds option.

When it's set to -1 (by default), the behaviour of caches doesn't change, so it stays backwards-compatible.

Otherwise, you can specify a value in seconds, for example spark.sql.metadataCacheTTLSeconds: 60 means 1-minute cache TTL.

How was this patch tested?

Added new tests in:

FileIndexSuite
SessionCatalogSuite

maropu · 2020-06-18T00:25:28Z

ok to test

maropu · 2020-06-18T00:25:43Z

Could you add tests?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-06-18T05:01:03Z

Test build #124182 has finished for PR 28852 at commit f03fe24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sap1ens · 2020-06-18T16:45:18Z

@maropu regarding testing, I don't see a dedicated test suite for the FileStatusCache. It's implicitly tested in HiveSchemaInferenceSuite though.

Do you think I should create a new suite for FileStatusCache? Or try to extend HiveSchemaInferenceSuite?

SparkQA · 2020-06-18T21:04:21Z

Test build #124224 has finished for PR 28852 at commit f409366.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-19T06:46:33Z

Do you think I should create a new suite for FileStatusCache? Or try to extend HiveSchemaInferenceSuite?

How about adding CatalogFileIndexSuite in the hive package?

sap1ens · 2020-06-19T17:34:51Z

@maropu

Do you think I should create a new suite for FileStatusCache? Or try to extend HiveSchemaInferenceSuite?

How about adding CatalogFileIndexSuite in the hive package?

It looks like CatalogFileIndex does rely on InMemoryFileIndex as well, so the scope of testing is probably more than this change adds... I think that testing FileStatusCache exclusively is probably the most straightforward thing to do, but I'm happy to invest more time in testing CatalogFileIndex too.

If we go ahead with CatalogFileIndexSuite, doesn't it make more sense to put it in the core package, where CatalogFileIndex is located?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala

SparkQA · 2020-06-25T01:38:38Z

Test build #124502 has finished for PR 28852 at commit d9c5bf7.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-06-25T01:43:06Z

Test build #124504 has finished for PR 28852 at commit 28da5cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

SparkQA · 2020-06-29T13:17:20Z

Test build #124632 has finished for PR 28852 at commit 18feeb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

maropu · 2020-06-30T08:19:17Z

Looks okay. cc: @cloud-fan @dongjoon-hyun @HyukjinKwon

SparkQA · 2020-06-30T21:12:31Z

Test build #124678 has finished for PR 28852 at commit 1d5248e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sap1ens · 2020-06-30T22:30:45Z

^ the test failure seems to be unrelated, I see similar failures in other branches...

maropu · 2020-06-30T23:28:24Z

retest this please

SparkQA · 2020-07-01T00:33:09Z

Test build #124702 has finished for PR 28852 at commit 1d5248e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-01T00:34:48Z

retest this please

SparkQA · 2020-07-01T07:29:55Z

Test build #124709 has finished for PR 28852 at commit 1d5248e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-07-13T17:04:26Z

cc @cloud-fan @maryannxue @gengliangwang @hvanhovell

cloud-fan · 2020-07-16T14:26:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

    if (conf.caseSensitiveAnalysis) name else name.toLowerCase(Locale.ROOT)
  }

  private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {


Not related to this PR. I'm wondering how useful is this cache. The file listing is cached in another place(FileStatusCache), and seems this relation cache doesn't give many benefits. cc @viirya @maropu

Ah, I see. As you suggested, the most painful part (listed files) has already been cached there. But, it seems some datasources still has somewhat processing costs when resolving a relation (e.g., JDBC datasources send a query to an external database for schema resolution), so I think we need to carefully check performance impacts for removing this cache.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

Line 256 in fb51925

LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)

For external data sources, it's common that data are changed outside of Spark. I think it's more important to make sure we get the latest data in a new query. Maybe we should disable this relation cache by default.

Hmm, I think this cache is still useful for avoiding inferring schema again. This is also an expensive operation.

ah that's a good point. We should probably investigate how to design the data source API so that sources don't need to infer schema can skip this cache. It's hard to use the JDBC data source as we need to run REFRESH TABLE (or wait for TTL after this PR) once the table is changed outside of spark (which is common to JDBC source).

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

SparkQA · 2020-07-17T07:05:02Z

Test build #126027 has finished for PR 28852 at commit 3e761dc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-17T07:26:03Z

retest this please

SparkQA · 2020-07-17T12:57:46Z

Test build #126042 has finished for PR 28852 at commit 3e761dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-17T13:40:51Z

thanks, merging to master!

### What changes were proposed in this pull request? This is a follow-up of #28852. This PR to use only config name; otherwise the doc for the config entry shows the entire details of the referring configs. ### Why are the changes needed? The doc for the newly introduced config entry shows the entire details of the referring configs. ### Does this PR introduce _any_ user-facing change? The doc for the config entry will show only the referring config keys. ### How was this patch tested? Existing tests. Closes #29194 from ueshin/issues/SPARK-30616/fup. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile · 2020-07-28T18:29:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

+      "session catalog cache. This configuration only has an effect when this value having " +
+      "a positive value (> 0). It also requires setting " +
+      s"${StaticSQLConf.CATALOG_IMPLEMENTATION} to `hive`, setting " +
+      s"${SQLConf.HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE} > 0 and setting " +


Could you update the message by using ${SQLConf.HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE.key} ?

That was done in #29194 :)

probot-autolabeler bot added the SQL label Jun 17, 2020

khanyawar approved these changes Jun 17, 2020

View reviewed changes

maropu reviewed Jun 18, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

maropu reviewed Jun 18, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

sap1ens added 3 commits June 24, 2020 16:22

[SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

65e4ebd

[SPARK-30616][SQL] Update doc string

ea7bf0f

[SPARK-30616][SQL] Make a common cache TTL option + add tests

28da5cf

sap1ens force-pushed the SPARK-30616-metadata-cache-ttl branch from d9c5bf7 to 28da5cf Compare June 24, 2020 23:24

maropu reviewed Jun 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved