[SPARK-15678] Add support to REFRESH data source paths#13566
[SPARK-15678] Add support to REFRESH data source paths#13566sameeragarwal wants to merge 3 commits intoapache:masterfrom
Conversation
|
Test build #60187 has finished for PR 13566 at commit
|
| (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory)) | ||
| } | ||
| cachedData.foreach { | ||
| case data if data.plan.find { |
There was a problem hiding this comment.
Could you move this into a separate function; it was kinda hard to understand that it is a part of the case guard.
|
Looks pretty good. Left one comment. |
|
Thanks, I pulled it out in a separate function. |
|
Test build #60212 has finished for PR 13566 at commit
|
| * | ||
| * @since 2.0.0 | ||
| */ | ||
| def refreshResource(path: String): Unit |
There was a problem hiding this comment.
Should we call it invalidateCache() to reflect the things we actually done?
Also, it's a bit confusing to have this API on catalog, can we put it on SparkSession?
There was a problem hiding this comment.
I'm confusing with these Catalog/SessionCatalog/ExternalCatalog here, thought this is SessionCatalog or ExternalCatalog, so it make sense to be here (together with other API related to cache).
There was a problem hiding this comment.
I like invalidateCache() but the reason for choosing refreshResource() was to make it sound similar to refreshTable () above. Let me know if you prefer one over the other.
There was a problem hiding this comment.
resource does sound a bit werid to me
There was a problem hiding this comment.
alright, changed this to refreshByPath based on @ericl's suggestion :)
|
Test build #60317 has finished for PR 13566 at commit
|
|
LGTM, |
## What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
```
This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.
Expected behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
```
## How was this patch tested?
Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.
Author: Sameer Agarwal <sameer@databricks.com>
Closes #13566 from sameeragarwal/refresh-path-2.
(cherry picked from commit 468da03)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
This patch fixes this bug by adding support for
REFRESH paththat invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.Expected behavior:
How was this patch tested?
Unit tests for overwrites and appends in
ParquetQuerySuiteandCachedTableSuite.