[SPARK-15678][SQL] Not use cache on appends and overwrites#13419
[SPARK-15678][SQL] Not use cache on appends and overwrites#13419sameeragarwal wants to merge 2 commits intoapache:masterfrom
Conversation
| spark.range(1000).write.mode("overwrite").parquet(path) | ||
| val df = sqlContext.read.parquet(path).cache() | ||
| assert(df.count() == 1000) | ||
| sqlContext.range(10).write.mode("overwrite").parquet(path) |
|
Test build #59668 has finished for PR 13419 at commit
|
|
Hi, @sameeragarwal . |
|
@dongjoon-hyun no reason; old habits. I'll fix this. Thanks! :) |
|
I will prefer refreshing the dataset every time a dataset is reloaded but keeping existing ones unchanged. val df1 = sqlContext.read.parquet(dir).cache()
df1.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
val df2 = sqlContext.read.parquet(dir).count() // outputs 10
df2.count() // outputs 10
df1.count() // still outputs 1000 because it was cachedNeither approach is perfectly safe. So I don't have strong preference on either. |
|
@mengxr it seems like overwriting generates new files so we can achieve the same semantics without introducing an additional timestamp. The current solution should respect the contract for old dataframes while making sure that the new ones don't use the cached value. Let me know what you think. |
|
Also cc'ing @davies |
|
Test build #59706 has finished for PR 13419 at commit
|
|
I guess that the caching is done over multiple nodes. If the data for a dataset is updated physically and some of the nodes where the data was cached go down, would the existing |
|
@tejasapatil if the nodes where the data was cached go down, the CacheManager should still consider that data as cached. In that case, the next time the data is accessed, the underlying RDD will be recomputed and cached again. |
|
I ended up creating a small design doc describing the problem and presenting 2 possible solutions at https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#. Based on this, we decided in favor of option 2 (#13566) as it is a less intrusive change to the default behavior. I'm going to close this PR for now, but we may revisit this approach (i.e., option 1) for 2.1. |
What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
Expected behavior:
This patch fixes this bug by modifying the
ListingFileCataloglogic that used to only compare the directory name (as opposed to individual files) while comparing 2 plans. Note that in theory, this could lead to a slight regression (for large number of files) but I didn't notice any regression for micro-benchmarks with 1000s of files.How was this patch tested?
Unit tests for overwrites and appends in
ParquetQuerySuite.