[SPARK-15678] Add support to REFRESH data source paths by sameeragarwal · Pull Request #13566 · apache/spark

sameeragarwal · 2016-06-08T19:05:59Z

What changes were proposed in this pull request?

Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.

Current behavior:

val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset

This patch fixes this bug by adding support for REFRESH path that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.

Expected behavior:

val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset

How was this patch tested?

Unit tests for overwrites and appends in ParquetQuerySuite and CachedTableSuite.

SparkQA · 2016-06-08T20:55:34Z

Test build #60187 has finished for PR 13566 at commit ece34ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RefreshResource(path: String)

hvanhovell · 2016-06-08T21:31:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
+    }
+    cachedData.foreach {
+      case data if data.plan.find {


Could you move this into a separate function; it was kinda hard to understand that it is a part of the case guard.

hvanhovell · 2016-06-08T21:34:27Z

Looks pretty good. Left one comment.

sameeragarwal · 2016-06-09T00:52:05Z

Thanks, I pulled it out in a separate function.

SparkQA · 2016-06-09T02:45:01Z

Test build #60212 has finished for PR 13566 at commit 6acd0c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-06-09T23:32:11Z

sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala

+   *
+   * @since 2.0.0
+   */
+  def refreshResource(path: String): Unit


Should we call it invalidateCache() to reflect the things we actually done?

Also, it's a bit confusing to have this API on catalog, can we put it on SparkSession?

I'm confusing with these Catalog/SessionCatalog/ExternalCatalog here, thought this is SessionCatalog or ExternalCatalog, so it make sense to be here (together with other API related to cache).

I like invalidateCache() but the reason for choosing refreshResource() was to make it sound similar to refreshTable () above. Let me know if you prefer one over the other.

resource does sound a bit werid to me

alright, changed this to refreshByPath based on @ericl's suggestion :)

SparkQA · 2016-06-11T00:04:15Z

Test build #60317 has finished for PR 13566 at commit e79f3f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-06-11T03:41:13Z

LGTM,
Merging this into master and 2.0, thanks!

## What changes were proposed in this pull request? Spark currently incorrectly continues to use cached data even if the underlying data is overwritten. Current behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset ``` This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. Expected behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) spark.catalog.refreshResource(dir) sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset ``` ## How was this patch tested? Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #13566 from sameeragarwal/refresh-path-2. (cherry picked from commit 468da03) Signed-off-by: Davies Liu <davies.liu@gmail.com>

[SPARK-15678] Add support to REFRESH data source paths

ece34ab

hvanhovell reviewed Jun 8, 2016
View reviewed changes

review comments

6acd0c0

davies reviewed Jun 9, 2016
View reviewed changes

review comments

e79f3f7

asfgit closed this in 468da03 Jun 11, 2016

sameeragarwal mentioned this pull request Jun 11, 2016

[SPARK-15678][SQL] Not use cache on appends and overwrites #13419

Closed

gatorsmile mentioned this pull request Feb 6, 2017

[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand #16809

Closed

gengliangwang mentioned this pull request Apr 8, 2019

[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15678] Add support to REFRESH data source paths#13566

[SPARK-15678] Add support to REFRESH data source paths#13566
sameeragarwal wants to merge 3 commits intoapache:masterfrom
sameeragarwal:refresh-path-2

sameeragarwal commented Jun 8, 2016

Uh oh!

SparkQA commented Jun 8, 2016

Uh oh!

hvanhovell Jun 8, 2016

Uh oh!

hvanhovell commented Jun 8, 2016

Uh oh!

sameeragarwal commented Jun 9, 2016

Uh oh!

SparkQA commented Jun 9, 2016

Uh oh!

davies Jun 9, 2016

Uh oh!

davies Jun 9, 2016

Uh oh!

sameeragarwal Jun 10, 2016

Uh oh!

rxin Jun 10, 2016

Uh oh!

sameeragarwal Jun 10, 2016

Uh oh!

SparkQA commented Jun 11, 2016

Uh oh!

davies commented Jun 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sameeragarwal commented Jun 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 8, 2016

Uh oh!

hvanhovell Jun 8, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jun 8, 2016

Uh oh!

sameeragarwal commented Jun 9, 2016

Uh oh!

SparkQA commented Jun 9, 2016

Uh oh!

davies Jun 9, 2016

Choose a reason for hiding this comment

Uh oh!

davies Jun 9, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal Jun 10, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Jun 10, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal Jun 10, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 11, 2016

Uh oh!

davies commented Jun 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants