[SPARK-15678][SQL] Not use cache on appends and overwrites by sameeragarwal · Pull Request #13419 · apache/spark

sameeragarwal · 2016-05-31T19:00:45Z

What changes were proposed in this pull request?

Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.

Current behavior:

val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset

Expected behavior:

val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
df.count() // outputs 1000
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset

This patch fixes this bug by modifying the ListingFileCatalog logic that used to only compare the directory name (as opposed to individual files) while comparing 2 plans. Note that in theory, this could lead to a slight regression (for large number of files) but I didn't notice any regression for micro-benchmarks with 1000s of files.

How was this patch tested?

Unit tests for overwrites and appends in ParquetQuerySuite.

sameeragarwal · 2016-05-31T19:20:42Z

@yhuai @mengxr what are your thoughts on this approach?

dongjoon-hyun · 2016-05-31T19:58:04Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

+      spark.range(1000).write.mode("overwrite").parquet(path)
+      val df = sqlContext.read.parquet(path).cache()
+      assert(df.count() == 1000)
+      sqlContext.range(10).write.mode("overwrite").parquet(path)


sqlContext -> spark

SparkQA · 2016-05-31T20:01:15Z

Test build #59668 has finished for PR 13419 at commit ee631d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-31T20:04:26Z

Hi, @sameeragarwal .
Is there any reason to use SQLContext instead of SparkSession in this PR?

sameeragarwal · 2016-05-31T21:09:27Z

@dongjoon-hyun no reason; old habits. I'll fix this. Thanks! :)

mengxr · 2016-05-31T21:14:26Z

I will prefer refreshing the dataset every time a dataset is reloaded but keeping existing ones unchanged.

val df1 = sqlContext.read.parquet(dir).cache()
df1.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
val df2 = sqlContext.read.parquet(dir).count() // outputs 10
df2.count() // outputs 10
df1.count() // still outputs 1000 because it was cached

Neither approach is perfectly safe. So I don't have strong preference on either.

sameeragarwal · 2016-06-01T01:36:36Z

@mengxr it seems like overwriting generates new files so we can achieve the same semantics without introducing an additional timestamp. The current solution should respect the contract for old dataframes while making sure that the new ones don't use the cached value. Let me know what you think.

sameeragarwal · 2016-06-01T01:36:43Z

Also cc'ing @davies

SparkQA · 2016-06-01T02:44:38Z

Test build #59706 has finished for PR 13419 at commit a21013a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-06-02T06:09:20Z

I guess that the caching is done over multiple nodes. If the data for a dataset is updated physically and some of the nodes where the data was cached go down, would the existing cached dataset be invalidated and refreshed ? If not, then old dataframes can give inconsistent or incomplete data.

sameeragarwal · 2016-06-03T23:52:43Z

@tejasapatil if the nodes where the data was cached go down, the CacheManager should still consider that data as cached. In that case, the next time the data is accessed, the underlying RDD will be recomputed and cached again.

sameeragarwal · 2016-06-11T04:41:38Z

I ended up creating a small design doc describing the problem and presenting 2 possible solutions at https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#. Based on this, we decided in favor of option 2 (#13566) as it is a less intrusive change to the default behavior. I'm going to close this PR for now, but we may revisit this approach (i.e., option 1) for 2.1.

Drop cache on appends and overwrites

ee631d2

dongjoon-hyun reviewed May 31, 2016
View reviewed changes

more elegant solution

a21013a

sameeragarwal changed the title ~~[SPARK-15678][SQL] Drop cache on appends and overwrites~~ [SPARK-15678][SQL] Not use cache on appends and overwrites Jun 1, 2016

sameeragarwal closed this Jun 11, 2016

gatorsmile mentioned this pull request Feb 6, 2017

[SPARK-19463][SQL]refresh cache after the InsertIntoHadoopFsRelationCommand #16809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15678][SQL] Not use cache on appends and overwrites#13419

[SPARK-15678][SQL] Not use cache on appends and overwrites#13419
sameeragarwal wants to merge 2 commits intoapache:masterfrom
sameeragarwal:drop-cache-on-write

sameeragarwal commented May 31, 2016 •

edited

Loading

Uh oh!

sameeragarwal commented May 31, 2016

Uh oh!

dongjoon-hyun May 31, 2016

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

dongjoon-hyun commented May 31, 2016

Uh oh!

sameeragarwal commented May 31, 2016

Uh oh!

mengxr commented May 31, 2016 •

edited

Loading

Uh oh!

sameeragarwal commented Jun 1, 2016

Uh oh!

sameeragarwal commented Jun 1, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

tejasapatil commented Jun 2, 2016

Uh oh!

sameeragarwal commented Jun 3, 2016

Uh oh!

sameeragarwal commented Jun 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sameeragarwal commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sameeragarwal commented May 31, 2016

Uh oh!

dongjoon-hyun May 31, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

dongjoon-hyun commented May 31, 2016

Uh oh!

sameeragarwal commented May 31, 2016

Uh oh!

mengxr commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameeragarwal commented Jun 1, 2016

Uh oh!

sameeragarwal commented Jun 1, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

tejasapatil commented Jun 2, 2016

Uh oh!

sameeragarwal commented Jun 3, 2016

Uh oh!

sameeragarwal commented Jun 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sameeragarwal commented May 31, 2016 •

edited

Loading

mengxr commented May 31, 2016 •

edited

Loading