[SPARK-24596][SQL] Non-cascading Cache Invalidation #21594

maryannxue · 2018-06-19T22:27:48Z

What changes were proposed in this pull request?

Add parameter 'cascade' in CacheManager.uncacheQuery(). Under 'cascade=false' mode, only invalidate the current cache, and for other dependent caches, rebuild execution plan and reuse cached buffer.
Pass true/false from callers in different uncache scenarios:

Drop tables and regular (persistent) views: regular mode
Drop temporary views: non-cascading mode
Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
Call DataSet.unpersist(): non-cascading mode
Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.

How was this patch tested?

New tests in CachedTableSuite and DatasetCacheSuite.

maryannxue · 2018-06-20T00:22:24Z

cc @gatorsmile @cloud-fan

SparkQA · 2018-06-20T02:02:26Z

Test build #92105 has finished for PR 21594 at commit 71b93ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-20T02:43:33Z

Test build #92107 has finished for PR 21594 at commit b9f1507.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-20T04:02:18Z

Test build #92108 has finished for PR 21594 at commit 4171062.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

TomaszGaweda · 2018-06-20T09:53:24Z

IMHO it is good, but may confuse users. Could you please add some JavaDocs to explain the difference?

viirya · 2018-06-20T14:15:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

  }

  /**
   * Un-cache all the cache entries that refer to the given plan.


We should update this document.

viirya · 2018-06-20T14:24:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+          cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
+        } else {
+          val plan = spark.sessionState.executePlan(cd.plan).executedPlan
+          val newCache = InMemoryRelation(


hmm, if the plan to uncache is iterated after a plan containing it, doesn't this still use its cached plan?

Yes, you are right, although it wouldn't lead to any error just like all other compiled dataframes that refer to this old InMemoryRelation. I'll change this piece of code. But you've brought out another interesting question:
A scenario similar to what you've mentioned:

df2 = df1.filter(...) df2.cache() df1.cache() df1.collect()

, which means we cache the dependent cache first and the cache being depended upon next. Optimally when you do df2.collect(), you would like df2 to use the cached data in df1, but it doesn't work like this now since df2's execution plan has already been generated before we call df1.cache(). It might be worth revisiting the caches and update their plans if necessary when we call cacheQuery()

maryannxue · 2018-06-20T18:36:25Z

@TomaszGaweda @viirya Nice suggestion about the doc. I'll update it.

SparkQA · 2018-06-20T23:31:16Z

Test build #92145 has finished for PR 21594 at commit c3a8e92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-06-21T19:11:06Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   */
  def unpersist(blocking: Boolean): this.type = {
-    sparkSession.sharedState.cacheManager.uncacheQuery(this, blocking)
+    sparkSession.sharedState.cacheManager.uncacheQuery(this, false, blocking)


nit: it's clearer to write cascade =false

cloud-fan · 2018-06-21T19:11:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-  def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock {
-    uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
+  def uncacheQuery(query: Dataset[_],
+    cascade: Boolean, blocking: Boolean = true): Unit = writeLock {


nit

def f( param1: X, param2: Y)....

4 space indentation.

cloud-fan · 2018-06-21T19:12:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

   */
-  def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock {
+  def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
+    cascade: Boolean, blocking: Boolean): Unit = writeLock {


cloud-fan · 2018-06-21T19:13:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-  def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock {
+  def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
+    cascade: Boolean, blocking: Boolean): Unit = writeLock {
+    val condition: LogicalPlan => Boolean =


condition -> shouldUncache?

SparkQA · 2018-06-21T22:50:00Z

Test build #92185 has finished for PR 21594 at commit d97149d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ForeachBatchFunction(object):
case class ArrayDistinct(child: Expression)
class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])
trait PythonForeachBatchFunction

maryannxue · 2018-06-21T23:48:53Z

retest please

cloud-fan · 2018-06-22T00:18:26Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

  }
+
+  test("SPARK-24596 Non-cascading Cache Invalidation - uncache temporary view") {
+    withView("t1", "t2") {


withTempView

Yes.. good catch! A mistake caused by copy-paste.

cloud-fan · 2018-06-22T00:19:13Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+  }
+
+  test("SPARK-24596 Non-cascading Cache Invalidation - drop temporary view") {
+    withView("t1", "t2") {


cloud-fan · 2018-06-22T00:24:03Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala

+    val df5 = df.agg(sum('a)).filter($"sum(a)" > 1)
+    assertCached(df5)
+    // first time use, load cache
+    df5.collect()


how do we prove this takes more than 5 seconds?

We just need to prove the new InMemoryRelation works alright for building cache (since the plan has been re-compiled) ... maybe we should check result though. Plus, I deliberately made this dataframe not dependent on the UDF so it can finish quickly.

cloud-fan · 2018-06-22T00:24:28Z

LGTM except some comments about test

SparkQA · 2018-06-22T04:45:13Z

Test build #92194 has finished for PR 21594 at commit 2f00f2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM except a few minor comments.

gatorsmile · 2018-06-21T21:57:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

   */
-  def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock {
+  def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
+      cascade: Boolean, blocking: Boolean): Unit = writeLock {


gatorsmile · 2018-06-22T22:51:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-  def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock {
-    uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
+  def uncacheQuery(query: Dataset[_],
+      cascade: Boolean, blocking: Boolean = true): Unit = writeLock {


gatorsmile · 2018-06-22T22:55:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

    // Also try to drop the contents of the table from the columnar cache
    try {
-      spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier))
+      spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier), true)


named argument. cascade = true

gatorsmile · 2018-06-22T22:55:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+          needToRecache += cd.copy(cachedRepresentation = newCache)
+        }
+      }
+      needToRecache.foreach(cachedData.add)


create a private function from line 144 and line 158?

It's almost the same logic as "recache", except that it tries to reuse the cached buffer here. It would be nice to integrate these two, but it would look so clean given the inconvenience of copying a CacheBuilder. I'll try though.

gatorsmile · 2018-06-22T23:35:39Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   */
  def unpersist(blocking: Boolean): this.type = {
-    sparkSession.sharedState.cacheManager.uncacheQuery(this, blocking)
+    sparkSession.sharedState.cacheManager.uncacheQuery(this, cascade = false, blocking)


Also update the comment of line 2966 and line 2979 and explain the new behavior

gatorsmile · 2018-06-22T23:37:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala


  override def run(sparkSession: SparkSession): Seq[Row] = {
    val catalog = sparkSession.sessionState.catalog
+    val isTempTable = catalog.isTemporaryTable(tableName)


rename it to isTempView

gatorsmile · 2018-06-22T23:38:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

      try {
-        sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
+        sparkSession.sharedState.cacheManager.uncacheQuery(
+          sparkSession.table(tableName), !isTempTable)


cascade = !isTempTable

gatorsmile · 2018-06-22T23:39:13Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

-    sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
+    val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
+    sparkSession.sharedState.cacheManager.uncacheQuery(
+      sparkSession.table(tableName), !sessionCatalog.isTemporaryTable(tableIdent))


val cascade = !sessionCatalog.isTemporaryTable(tableIdent)

...

gatorsmile · 2018-06-22T23:39:24Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

    if (isCached(table)) {
      // Uncache the logicalPlan.
-      sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
+      sparkSession.sharedState.cacheManager.uncacheQuery(table, true, blocking = true)


the same here.

gatorsmile · 2018-06-22T23:41:31Z

document the behavior changes in the # Migration Guide of /docs/sql-programming-guide.md

gatorsmile · 2018-06-22T23:41:58Z

@maryannxue Thanks for fixing the current behavior! This is a very important fix.

SparkQA · 2018-06-24T07:05:01Z

Test build #92265 has finished for PR 21594 at commit bf42fdf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-24T23:33:20Z

retest this please

SparkQA · 2018-06-25T03:17:06Z

Test build #92274 has finished for PR 21594 at commit bf42fdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-25T06:24:05Z

LGTM

SparkQA · 2018-06-25T07:05:02Z

Test build #92286 has finished for PR 21594 at commit f7d48e5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-25T07:05:02Z

Test build #92287 has finished for PR 21594 at commit bfda8c1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-25T07:05:11Z

retest this please

SparkQA · 2018-06-25T10:47:19Z

Test build #92290 has finished for PR 21594 at commit bfda8c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-25T14:18:03Z

Thanks! Merged to master.

maryannxue added 7 commits June 18, 2018 21:32

noncascading cache

27e484b

code refine

483008c

Merge remote-tracking branch 'origin/master' into noncascading-cache

a782aac

refine test cases

0cd8dc1

refine test cases

71b93ed

Fix a caller; add one test

b9f1507

add one test

4171062

viirya reviewed Jun 20, 2018

View reviewed changes

Address code review feedback; add one more test

c3a8e92

cloud-fan reviewed Jun 21, 2018

View reviewed changes

Address review comments

d97149d

cloud-fan reviewed Jun 22, 2018

View reviewed changes

address review comments

2f00f2f

gatorsmile reviewed Jun 22, 2018

View reviewed changes

address code review feedback

bf42fdf

maryannxue added 2 commits June 24, 2018 23:13

update doc

f7d48e5

update doc

bfda8c1

asfgit closed this in bac50aa Jun 25, 2018

kbendick mentioned this pull request Sep 27, 2020

Reconsider caching behavior in Spark 3 apache/iceberg#1485

Closed

cloud-fan mentioned this pull request Jan 19, 2021

[SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT" #31107

Closed

[SPARK-24596][SQL] Non-cascading Cache Invalidation #21594

[SPARK-24596][SQL] Non-cascading Cache Invalidation #21594

Uh oh!

Conversation

maryannxue commented Jun 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maryannxue commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

TomaszGaweda commented Jun 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2018

Uh oh!

maryannxue commented Jun 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 22, 2018

Uh oh!

SparkQA commented Jun 22, 2018

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue commented Jun 19, 2018 •

edited

Loading

cloud-fan Jun 21, 2018 •

edited

Loading

gatorsmile commented Jun 22, 2018 •

edited

Loading