[SPARK-3645][SQL] Makes table caching eager by default and adds syntax for lazy caching #2513

liancheng · 2014-09-24T00:50:47Z

Although lazy caching for in-memory table seems consistent with the RDD.cache() API, it's relatively confusing for users who mainly work with SQL and not familiar with Spark internals. The CACHE TABLE t; SELECT COUNT(*) FROM t; pattern is also commonly seen just to ensure predictable performance.

This PR makes both the CACHE TABLE t [AS SELECT ...] statement and the SQLContext.cacheTable() API eager by default, and adds a new CACHE LAZY TABLE t [AS SELECT ...] syntax to provide lazy in-memory table caching.

Also, took the chance to make some refactoring: CacheCommand and CacheTableAsSelectCommand are now merged and renamed to CacheTableCommand since the former is strictly a special case of the latter. A new UncacheTableCommand is added for the UNCACHE TABLE t statement.

SparkQA · 2014-09-24T00:54:26Z

QA tests have started for PR 2513 at commit b72e24e.

This patch merges cleanly.

SparkQA · 2014-09-24T01:47:33Z

QA tests have finished for PR 2513 at commit b72e24e.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheTableCommand(tableName: String) extends Command
- case class CacheTableCommand(tableName: String, logicalPlan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheCommand(tableName: String) extends LeafNode with Command
- case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

SparkQA · 2014-09-24T01:47:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20735/

SparkQA · 2014-09-24T15:44:20Z

QA tests have started for PR 2513 at commit 8d2192d.

This patch merges cleanly.

SparkQA · 2014-09-24T16:37:06Z

QA tests have finished for PR 2513 at commit 8d2192d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheTableCommand(tableName: String) extends Command
- case class CacheTableCommand(tableName: String, logicalPlan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheCommand(tableName: String) extends LeafNode with Command
- case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

SparkQA · 2014-09-24T16:37:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20753/

marmbrus · 2014-10-01T19:52:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala

This is kind of a nit, but I talked to @aarondav and we are thinking that CACHE TABLE LAZY might be a little more consistent. The reasoning being that CACHE is really the most important verb here and so should go first. This is similar to INSERT INTO TABLE vs INSERT OVERWRITE TABLE.

Agree, I wasn't very sure about the syntax either when add this.

liancheng · 2014-10-02T14:02:30Z

Updated, replaced LAZY CACHE TABLE with CACHE TABLE LAZY, also refactored test suites to check in-memory column RDD materialization.

SparkQA · 2014-10-02T14:04:30Z

QA tests have started for PR 2513 at commit 39d243e.

This patch merges cleanly.

SparkQA · 2014-10-02T14:54:05Z

QA tests have finished for PR 2513 at commit 39d243e.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheTableCommand(tableName: String) extends Command
- case class CacheTableCommand(tableName: String, logicalPlan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheCommand(tableName: String) extends LeafNode with Command
- case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

AmplabJenkins · 2014-10-02T14:54:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21201/

liancheng · 2014-10-04T11:36:06Z

Rebased to the master, with the new CACHE LAZY TABLE t syntax.

SparkQA · 2014-10-04T11:39:42Z

QA tests have started for PR 2513 at commit fe92287.

This patch merges cleanly.

SparkQA · 2014-10-04T12:32:10Z

QA tests have finished for PR 2513 at commit fe92287.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheTableCommand(tableName: String) extends Command
- case class CacheTableCommand(
- case class UncacheTableCommand(tableName: String) extends LeafNode with Command
- case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

AmplabJenkins · 2014-10-04T12:32:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21292/Test PASSed.

liancheng · 2014-10-04T14:42:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala

Added keyword LAZY and sorted all the keywords in alphabetical order here. This list was once sorted but broken later.

marmbrus · 2014-10-06T00:53:46Z

I'm going to merge this. Feel free to clean up minor ";" issue as part of the other parser refactoring you are doing. Thanks :)

liancheng · 2014-10-07T01:49:53Z

sql/core/src/main/scala/org/apache/spark/sql/CacheManager.scala

@marmbrus Forgot to confirm this with you: default value of the blocking argument is true in RDD.unpersist(), I changed the default value here to keep the semantics consistent. This also makes testing more easily (I added assertions to check RDD materialization, non-blocking unpersisting introduces some subtleties). Did you intend to use non-blocking unpersisting here?

No, I mistakenly though that was the default. We should match the original semantics.

marmbrus reviewed Oct 1, 2014
View reviewed changes

liancheng force-pushed the eager-caching branch from 8d2192d to 39d243e Compare October 2, 2014 14:01

Makes table caching eager by default and adds syntax for lazy caching

fe92287

liancheng force-pushed the eager-caching branch from 39d243e to fe92287 Compare October 4, 2014 11:34

liancheng reviewed Oct 4, 2014
View reviewed changes

asfgit closed this in 34b97a0 Oct 6, 2014

liancheng reviewed Oct 7, 2014
View reviewed changes

liancheng deleted the eager-caching branch October 9, 2014 05:57

[SPARK-3645][SQL] Makes table caching eager by default and adds syntax for lazy caching #2513

[SPARK-3645][SQL] Makes table caching eager by default and adds syntax for lazy caching #2513

Uh oh!

Conversation

liancheng commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

SparkQA commented Sep 24, 2014

Uh oh!

marmbrus Oct 1, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng Oct 2, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 2, 2014

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

AmplabJenkins commented Oct 2, 2014

Uh oh!

liancheng commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

AmplabJenkins commented Oct 4, 2014

Uh oh!

liancheng Oct 4, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Oct 6, 2014

Uh oh!

liancheng Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants