[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe #30815

imback82 · 2020-12-17T03:28:23Z

What changes were proposed in this pull request?

This PR proposes to update CACHE TABLE to use a LogicalPlan when caching a query to avoid creating a DataFrame as suggested here: #30743 (comment)

For reference, UNCACHE TABLE also uses LogicalPlan:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala

Lines 91 to 98 in 0c12900

    
           case class UncacheTableExec( 
        
               relation: LogicalPlan, 
        
               cascade: Boolean) extends V2CommandExec { 
        
             override def run(): Seq[InternalRow] = { 
        
               val sparkSession = sqlContext.sparkSession 
        
               sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession, relation, cascade) 
        
               Seq.empty 
        
             }

Why are the changes needed?

To avoid creating an unnecessary dataframe and make it consistent with uncacheQuery used in UNCACHE TABLE.

Does this PR introduce any user-facing change?

No, just internal changes.

How was this patch tested?

Existing tests since this is an internal refactoring change.

SparkQA · 2020-12-17T04:12:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37524/

SparkQA · 2020-12-17T04:17:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37524/

imback82 · 2020-12-17T05:04:33Z

cc @cloud-fan, thanks!

cloud-fan · 2020-12-17T05:08:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+   * Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
+   * recomputing the in-memory columnar representation of the underlying table is expensive.
+   */
+  def cacheQueryWithLogicalPlan(


shall we just call it cacheQuery to be consistent with uncacheQuery?

Unfortunately, we cannot because Scala doesn't allow overloading methods with default arguments.

Ah, I think we can do the folllowing:

def cacheQuery( spark: SparkSession, planToCache: LogicalPlan, tableName: Option[String]): Unit = { cacheQuery(spark, planToCache, tableName, MEMORY_AND_DISK) } def cacheQuery( spark: SparkSession, planToCache: LogicalPlan, tableName: Option[String], storageLevel: StorageLevel)

cloud-fan

LGTM

cloud-fan · 2020-12-17T05:28:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+  def cacheQueryWithLogicalPlan(
+      spark: SparkSession,
+      planToCache: LogicalPlan,
+      tableName: Option[String] = None,


does this method need default parameter values?

I think https://github.com/apache/spark/pull/30815/files#r544822268 should work. Let me update this PR. Thanks!

SparkQA · 2020-12-17T07:05:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37537/

SparkQA · 2020-12-17T07:35:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37537/

SparkQA · 2020-12-17T09:06:18Z

Test build #132921 has finished for PR 30815 at commit d9dcfb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-17T10:22:35Z

Test build #132934 has finished for PR 30815 at commit 52de954.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-17T14:52:14Z

retest this please

SparkQA · 2020-12-17T16:05:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37561/

SparkQA · 2020-12-17T16:36:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37561/

dongjoon-hyun · 2020-12-17T17:39:23Z

cc @sunchao

sunchao · 2020-12-17T18:12:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+   * Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
+   * recomputing the in-memory columnar representation of the underlying table is expensive.
+   */
+  def cacheQuery(


+1 on this! I think we may replace one more usage in DataSourceV2Strategy.invalidateCache as well.

thanks, updated.

sunchao · 2020-12-17T18:15:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+      spark: SparkSession,
+      planToCache: LogicalPlan,
+      tableName: Option[String],
+      storageLevel: StorageLevel): Unit = {


perhaps we can just keep a single method with default value of tableName being None and storageLevel being MEMORY_AND_DISK?

The scala compiler will complain if we do that. See #30815 (comment)

Ah got it. Thanks.

SparkQA · 2020-12-17T21:10:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37567/

SparkQA · 2020-12-17T21:47:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37567/

SparkQA · 2020-12-17T23:37:39Z

Test build #132959 has finished for PR 30815 at commit 52de954.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-18T00:45:13Z

Test build #132964 has finished for PR 30815 at commit bc80b6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-18T04:30:14Z

thanks, merging to master!

initial commit

d9dcfb3

github-actions bot added the SQL label Dec 17, 2020

cloud-fan reviewed Dec 17, 2020

View reviewed changes

cloud-fan approved these changes Dec 17, 2020

View reviewed changes

cloud-fan reviewed Dec 17, 2020

View reviewed changes

Address PR comment

52de954

sunchao reviewed Dec 17, 2020

View reviewed changes

address PR comment

bc80b6d

cloud-fan closed this in 0f1a183 Dec 18, 2020

	case class UncacheTableExec(
	relation: LogicalPlan,
	cascade: Boolean) extends V2CommandExec {
	override def run(): Seq[InternalRow] = {
	val sparkSession = sqlContext.sparkSession
	sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession, relation, cascade)
	Seq.empty
	}

[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe #30815

[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe #30815

Uh oh!

Conversation

imback82 commented Dec 17, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

imback82 commented Dec 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

cloud-fan commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

dongjoon-hyun commented Dec 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

cloud-fan commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants