Skip to content

Conversation

@sap1ens
Copy link
Contributor

@sap1ens sap1ens commented Jun 17, 2020

What changes were proposed in this pull request?

New spark.sql.metadataCacheTTLSeconds option that adds time-to-live cache behaviour to the existing caches in FileStatusCache and SessionCatalog.

Why are the changes needed?

Currently Spark caches file listing for tables and requires issuing REFRESH TABLE any time the file listing has changed outside of Spark. Unfortunately, simply submitting REFRESH TABLE commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution.

This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.).

A similar feature exists in Presto: hive.file-status-cache-expire-time can be found here.

Does this PR introduce any user-facing change?

Yes, it's controlled with the new spark.sql.metadataCacheTTLSeconds option.

When it's set to -1 (by default), the behaviour of caches doesn't change, so it stays backwards-compatible.

Otherwise, you can specify a value in seconds, for example spark.sql.metadataCacheTTLSeconds: 60 means 1-minute cache TTL.

How was this patch tested?

Added new tests in:

  • FileIndexSuite
  • SessionCatalogSuite

@maropu
Copy link
Member

maropu commented Jun 18, 2020

ok to test

@maropu
Copy link
Member

maropu commented Jun 18, 2020

Could you add tests?

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124182 has finished for PR 28852 at commit f03fe24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sap1ens
Copy link
Contributor Author

sap1ens commented Jun 18, 2020

@maropu regarding testing, I don't see a dedicated test suite for the FileStatusCache. It's implicitly tested in HiveSchemaInferenceSuite though.

Do you think I should create a new suite for FileStatusCache? Or try to extend HiveSchemaInferenceSuite?

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124224 has finished for PR 28852 at commit f409366.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 19, 2020

Do you think I should create a new suite for FileStatusCache? Or try to extend HiveSchemaInferenceSuite?

How about adding CatalogFileIndexSuite in the hive package?

@sap1ens
Copy link
Contributor Author

sap1ens commented Jun 19, 2020

@maropu

Do you think I should create a new suite for FileStatusCache? Or try to extend HiveSchemaInferenceSuite?

How about adding CatalogFileIndexSuite in the hive package?

It looks like CatalogFileIndex does rely on InMemoryFileIndex as well, so the scope of testing is probably more than this change adds... I think that testing FileStatusCache exclusively is probably the most straightforward thing to do, but I'm happy to invest more time in testing CatalogFileIndex too.

If we go ahead with CatalogFileIndexSuite, doesn't it make more sense to put it in the core package, where CatalogFileIndex is located?

@sap1ens sap1ens force-pushed the SPARK-30616-metadata-cache-ttl branch from d9c5bf7 to 28da5cf Compare June 24, 2020 23:24
@SparkQA
Copy link

SparkQA commented Jun 25, 2020

Test build #124502 has finished for PR 28852 at commit d9c5bf7.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2020

Test build #124504 has finished for PR 28852 at commit 28da5cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124632 has finished for PR 28852 at commit 18feeb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sap1ens sap1ens requested a review from maropu June 29, 2020 16:51
@maropu
Copy link
Member

maropu commented Jun 30, 2020

Looks okay. cc: @cloud-fan @dongjoon-hyun @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124678 has finished for PR 28852 at commit 1d5248e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sap1ens
Copy link
Contributor Author

sap1ens commented Jun 30, 2020

^ the test failure seems to be unrelated, I see similar failures in other branches...

@maropu
Copy link
Member

maropu commented Jun 30, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124702 has finished for PR 28852 at commit 1d5248e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 1, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124709 has finished for PR 28852 at commit 1d5248e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sap1ens sap1ens requested a review from gatorsmile July 8, 2020 18:02
@gatorsmile
Copy link
Member

if (conf.caseSensitiveAnalysis) name else name.toLowerCase(Locale.ROOT)
}

private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR. I'm wondering how useful is this cache. The file listing is cached in another place(FileStatusCache), and seems this relation cache doesn't give many benefits. cc @viirya @maropu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. As you suggested, the most painful part (listed files) has already been cached there. But, it seems some datasources still has somewhat processing costs when resolving a relation (e.g., JDBC datasources send a query to an external database for schema resolution), so I think we need to carefully check performance impacts for removing this cache.

LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For external data sources, it's common that data are changed outside of Spark. I think it's more important to make sure we get the latest data in a new query. Maybe we should disable this relation cache by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this cache is still useful for avoiding inferring schema again. This is also an expensive operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's a good point. We should probably investigate how to design the data source API so that sources don't need to infer schema can skip this cache. It's hard to use the JDBC data source as we need to run REFRESH TABLE (or wait for TTL after this PR) once the table is changed outside of spark (which is common to JDBC source).

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126027 has finished for PR 28852 at commit 3e761dc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126042 has finished for PR 28852 at commit 3e761dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 34baed8 Jul 17, 2020
cloud-fan pushed a commit that referenced this pull request Jul 23, 2020
### What changes were proposed in this pull request?

This is a follow-up of #28852.

This PR to use only config name; otherwise the doc for the config entry shows the entire details of the referring configs.

### Why are the changes needed?

The doc for the newly introduced config entry shows the entire details of the referring configs.

### Does this PR introduce _any_ user-facing change?

The doc for the config entry will show only the referring config keys.

### How was this patch tested?

Existing tests.

Closes #29194 from ueshin/issues/SPARK-30616/fup.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
"session catalog cache. This configuration only has an effect when this value having " +
"a positive value (> 0). It also requires setting " +
s"${StaticSQLConf.CATALOG_IMPLEMENTATION} to `hive`, setting " +
s"${SQLConf.HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE} > 0 and setting " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the message by using ${SQLConf.HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE.key} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was done in #29194 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants