Skip to content

Conversation

@maryannxue
Copy link
Contributor

@maryannxue maryannxue commented Jun 19, 2018

What changes were proposed in this pull request?

  1. Add parameter 'cascade' in CacheManager.uncacheQuery(). Under 'cascade=false' mode, only invalidate the current cache, and for other dependent caches, rebuild execution plan and reuse cached buffer.
  2. Pass true/false from callers in different uncache scenarios:
  • Drop tables and regular (persistent) views: regular mode
  • Drop temporary views: non-cascading mode
  • Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
  • Call DataSet.unpersist(): non-cascading mode
  • Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets.

How was this patch tested?

New tests in CachedTableSuite and DatasetCacheSuite.

@maryannxue
Copy link
Contributor Author

cc @gatorsmile @cloud-fan

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92105 has finished for PR 21594 at commit 71b93ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92107 has finished for PR 21594 at commit b9f1507.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92108 has finished for PR 21594 at commit 4171062.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@TomaszGaweda
Copy link

IMHO it is good, but may confuse users. Could you please add some JavaDocs to explain the difference?

}

/**
* Un-cache all the cache entries that refer to the given plan.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update this document.

cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
} else {
val plan = spark.sessionState.executePlan(cd.plan).executedPlan
val newCache = InMemoryRelation(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if the plan to uncache is iterated after a plan containing it, doesn't this still use its cached plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, although it wouldn't lead to any error just like all other compiled dataframes that refer to this old InMemoryRelation. I'll change this piece of code. But you've brought out another interesting question:
A scenario similar to what you've mentioned:

df2 = df1.filter(...)
df2.cache()
df1.cache()
df1.collect()

, which means we cache the dependent cache first and the cache being depended upon next. Optimally when you do df2.collect(), you would like df2 to use the cached data in df1, but it doesn't work like this now since df2's execution plan has already been generated before we call df1.cache(). It might be worth revisiting the caches and update their plans if necessary when we call cacheQuery()

@maryannxue
Copy link
Contributor Author

@TomaszGaweda @viirya Nice suggestion about the doc. I'll update it.

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92145 has finished for PR 21594 at commit c3a8e92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def unpersist(blocking: Boolean): this.type = {
sparkSession.sharedState.cacheManager.uncacheQuery(this, blocking)
sparkSession.sharedState.cacheManager.uncacheQuery(this, false, blocking)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's clearer to write cascade =false

def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock {
uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
def uncacheQuery(query: Dataset[_],
cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

def f(
    param1: X,
    param2: Y)....

4 space indentation.

*/
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock {
def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
cascade: Boolean, blocking: Boolean): Unit = writeLock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock {
def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
cascade: Boolean, blocking: Boolean): Unit = writeLock {
val condition: LogicalPlan => Boolean =
Copy link
Contributor

@cloud-fan cloud-fan Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition -> shouldUncache?

@SparkQA
Copy link

SparkQA commented Jun 21, 2018

Test build #92185 has finished for PR 21594 at commit d97149d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ForeachBatchFunction(object):
  • case class ArrayDistinct(child: Expression)
  • class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])
  • trait PythonForeachBatchFunction

@maryannxue
Copy link
Contributor Author

retest please

}

test("SPARK-24596 Non-cascading Cache Invalidation - uncache temporary view") {
withView("t1", "t2") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withTempView

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.. good catch! A mistake caused by copy-paste.

}

test("SPARK-24596 Non-cascading Cache Invalidation - drop temporary view") {
withView("t1", "t2") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

val df5 = df.agg(sum('a)).filter($"sum(a)" > 1)
assertCached(df5)
// first time use, load cache
df5.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we prove this takes more than 5 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to prove the new InMemoryRelation works alright for building cache (since the plan has been re-compiled) ... maybe we should check result though. Plus, I deliberately made this dataframe not dependent on the UDF so it can finish quickly.

@cloud-fan
Copy link
Contributor

LGTM except some comments about test

@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92194 has finished for PR 21594 at commit 2f00f2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a few minor comments.

*/
def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock {
def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
cascade: Boolean, blocking: Boolean): Unit = writeLock {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent.

def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock {
uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
def uncacheQuery(query: Dataset[_],
cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

// Also try to drop the contents of the table from the columnar cache
try {
spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier))
spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named argument. cascade = true

needToRecache += cd.copy(cachedRepresentation = newCache)
}
}
needToRecache.foreach(cachedData.add)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a private function from line 144 and line 158?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost the same logic as "recache", except that it tries to reuse the cached buffer here. It would be nice to integrate these two, but it would look so clean given the inconvenience of copying a CacheBuilder. I'll try though.

*/
def unpersist(blocking: Boolean): this.type = {
sparkSession.sharedState.cacheManager.uncacheQuery(this, blocking)
sparkSession.sharedState.cacheManager.uncacheQuery(this, cascade = false, blocking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the comment of line 2966 and line 2979 and explain the new behavior


override def run(sparkSession: SparkSession): Seq[Row] = {
val catalog = sparkSession.sessionState.catalog
val isTempTable = catalog.isTemporaryTable(tableName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it to isTempView

try {
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession.table(tableName), !isTempTable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cascade = !isTempTable

sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession.table(tableName), !sessionCatalog.isTemporaryTable(tableIdent))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val cascade = !sessionCatalog.isTemporaryTable(tableIdent)

...

if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
sparkSession.sharedState.cacheManager.uncacheQuery(table, true, blocking = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same here.

@gatorsmile
Copy link
Member

document the behavior changes in the # Migration Guide of /docs/sql-programming-guide.md

@gatorsmile
Copy link
Member

gatorsmile commented Jun 22, 2018

@maryannxue Thanks for fixing the current behavior! This is a very important fix.

@SparkQA
Copy link

SparkQA commented Jun 24, 2018

Test build #92265 has finished for PR 21594 at commit bf42fdf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92274 has finished for PR 21594 at commit bf42fdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92286 has finished for PR 21594 at commit f7d48e5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92287 has finished for PR 21594 at commit bfda8c1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92290 has finished for PR 21594 at commit bfda8c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants